Glossary · Term

PPO

← all terms

Definition

The standard reinforcement-learning algorithm used to fine-tune most modern AI assistants.

Proximal Policy Optimization, a clipped-objective policy-gradient method that underlies many modern RLHF pipelines.

Mentioned in 3 episodes

  1. 026
    What RL Actually Does to Language Models, at the Token Level
  2. 025
    The Missing Gradient Term That Predicts Sycophancy in RLHF
  3. 010
    When Reward Climbs But Reasoning Goes Generic: Diagnosing Template Collapse in Agentic RL

Related concepts