Definition
A published RL pipeline used as a high-cost baseline for reasoning model training.
An open recipe for large-scale PPO training of reasoning models, used as a cost-and-quality reference point in the ReasonMaxxer comparisons (estimated ~$103K to train a 32B reasoning model).