Glossary · Term

policy gradient

Definition

Plain language

A reinforcement-learning method that nudges a model's behavior toward actions that scored higher.

As stated in the literature

A family of RL algorithms that estimate gradients of expected return with respect to policy parameters by sampling trajectories.

Also called: policy-gradient

Why it matters: It's the foundation under PPO, GRPO, and nearly every modern RL post-training method for LLMs.

For example, after a successful rollout, the policy is nudged to make the actions it took slightly more likely next time.

“So you can reward or penalize that action directly with standard policy-gradient reinforcement learning.”

GRPO

gradient parameter policy reinforcement learning trajectory