Glossary · Term

PPO

Definition

Plain language

The standard reinforcement-learning algorithm used to fine-tune most modern AI assistants.

As stated in the literature

Proximal Policy Optimization, a clipped-objective policy-gradient method that underlies many modern RLHF pipelines.

Why it matters: Its stability is what made large-scale RLHF practical, and most production assistants today were tuned with some PPO variant.

For example, PPO clips each update so the new policy can't drift too far from the old one in a single training step.

“They use PPO — an actual, standard reinforcement learning algorithm.”

Policy Gradient

policy gradient RLHF