SRFT · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A mixed-policy reasoning post-training method.

As stated in the literature

Self-Refinement Fine-Tuning, a mixed-policy post-training method blending supervised and on-policy data, used as a baseline in math-reasoning post-training comparisons where headline gains turned out to depend on infrastructure bugs.

Why it matters: Mixed-policy methods like SRFT looked like a clear win until follow-up work showed their gains depended on infrastructure quirks — a reminder that headline improvements in this field often hide subtle bugs.

For example, during post-training the model is updated on a blend of expert-curated solutions and its own attempts, mixing imitation with reinforcement.

Heard on the show

“LUFFY, ReLIFT, SRFT, Prefix-RFT, HPT — methods that said: don't separate the stages, blend them.”

Episode 009 — How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers

Mentioned in 1 episode

009
How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers

Related terms

policy post-training