REINFORCE · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A classic reinforcement-learning rule where you push the model toward whatever it did when it won, and away from what it did when it lost.

As stated in the literature

A Monte Carlo policy-gradient algorithm that updates parameters in the direction of log-probability of sampled actions weighted by return.

Why it matters: It's the conceptual ancestor of nearly every policy-gradient method used in RL post-training, and understanding it makes the more elaborate algorithms easier to reason about.

For example, after the model produces an answer that turns out to be correct, training nudges it to be slightly more likely to produce those same tokens next time in the same situation.

Heard on the show

“The trick is called REINFORCE, or the policy gradient.”

Episode 060 — When Splitting One Model Across Three Agents Doubles Its Accuracy

Mentioned in 1 episode

060
When Splitting One Model Across Three Agents Doubles Its Accuracy

Related concepts

Policy Gradient

Related terms

Monte Carlo simulation parameter policy gradient