DAPO · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A reinforcement-learning method that compares several attempts at the same task to figure out which ones to reinforce.

As stated in the literature

Decoupled Adaptive Policy Optimization, a GRPO-family RL algorithm used as the optimizer in MaR-style metacognitive reward training.

Why it matters: GRPO-family methods like DAPO have become a standard recipe for RL training of reasoning and agent models without the cost of training a separate critic.

For example, DAPO has the model produce eight answers to the same math problem, scores them against the grader, and uses the spread of scores to decide which sampling traces to reinforce.

Heard on the show

“On a math training dataset called DAPO with a seven billion parameter model, their method at eight samples per problem matches vanilla GRPO at sixteen samples per problem.”

Episode 081 — When Reasoning Models Decide Before They Think: Detecting and Fixing Premature Confidence

Mentioned in 2 episodes

Related terms

GRPO MaR metacognition reinforcement learning