Definition
A mixed-policy reasoning post-training method.
A reinforcement-learning-from-trajectory post-training method that mixes expert demonstrations with on-policy rollouts; used as a comparison baseline in math-reasoning post-training studies.
A mixed-policy reasoning post-training method.
A reinforcement-learning-from-trajectory post-training method that mixes expert demonstrations with on-policy rollouts; used as a comparison baseline in math-reasoning post-training studies.