Prefix-RFT · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A mixed-policy reasoning post-training method.

As stated in the literature

A reinforcement-from-trajectory variant that interleaves expert-prefix demonstrations with on-policy rollouts; one of several mixed-policy baselines whose claimed gains over SFT-then-RL turned out to be artifacts of training-pipeline bugs.

Why it matters: Mixed-policy methods are a popular alternative to plain SFT-then-RL, but only careful pipeline implementations reveal whether the gains are real.

For example, each rollout begins from a partial expert demonstration before the model continues on its own and is graded.

Heard on the show

“LUFFY, ReLIFT, SRFT, Prefix-RFT, HPT — methods that said: don't separate the stages, blend them.”

Episode 009 — How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers

Mentioned in 1 episode

009
How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers

Related terms

pipelining policy reinforcement learning rollout SFT trajectory