LUFFY · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A mixed-policy training method that blends expert demonstrations with the model's own attempts in a single RL loop.

As stated in the literature

A mixed-policy post-training method that interleaves teacher demonstrations with policy rollouts inside a unified RL stage, used as a baseline in math-reasoning post-training comparisons.

Why it matters: Mixing demonstrations into RL can keep models from getting stuck only learning from their own narrow attempts, especially in hard reasoning domains.

For example, during training, half the rollouts in a batch come from following a teacher trace on a math problem and half come from the student's own attempts, with rewards applied uniformly.

Heard on the show

“LUFFY, ReLIFT, SRFT, Prefix-RFT, HPT — methods that said: don't separate the stages, blend them.”

Episode 009 — How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers

Mentioned in 1 episode

009
How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers

Related terms

policy post-training reinforcement learning rollout