Definition
An open RL training pipeline for reasoning models used as a comparison baseline.
An open-source GRPO-based RL post-training recipe for reasoning models on math problems, used as a cost baseline against ReasonMaxxer.
An open RL training pipeline for reasoning models used as a comparison baseline.
An open-source GRPO-based RL post-training recipe for reasoning models on math problems, used as a cost baseline against ReasonMaxxer.