Definition
Mixed-policy training trains a model on trajectories drawn from more than one policy — the current model, an older checkpoint, a teacher, a human, a sampler — in a controlled mixture. It’s a way to balance exploration, stability, and quality during RL or rejection-sampling training.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.