Definition
A mixed-policy training method that blends expert demonstrations with the model's own attempts in a single RL loop.
A mixed-policy post-training method that interleaves teacher demonstrations with policy rollouts inside a unified RL stage, used as a baseline in math-reasoning post-training comparisons.