Definition
A mixed-policy reasoning post-training method.
A reinforcement-from-trajectory variant that interleaves expert-prefix demonstrations with on-policy rollouts; one of several mixed-policy baselines whose claimed gains over SFT-then-RL turned out to be artifacts of training-pipeline bugs.