Concept · 1 episode(s)

Mixed-Policy Training

Definition

Mixed-policy training trains a model on trajectories drawn from more than one policy — the current model, an older checkpoint, a teacher, a human, a sampler — in a controlled mixture. It’s a way to balance exploration, stability, and quality during RL or rejection-sampling training.

Episodes covering this

009
How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers
SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
Limozin, Durech, Hoefler et al. · ETH AI Center·23 min·May 02, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.

LUFFY: Learning to Reason Under Off-Policy Guidance