Definition
The standard reinforcement-learning algorithm used to fine-tune most modern AI assistants.
Proximal Policy Optimization, a clipped-objective policy-gradient method that underlies many modern RLHF pipelines.
The standard reinforcement-learning algorithm used to fine-tune most modern AI assistants.
Proximal Policy Optimization, a clipped-objective policy-gradient method that underlies many modern RLHF pipelines.