Definition
Reinforcement learning for reasoning trains models to produce useful chains of thought by rewarding correct final answers (or verified intermediate steps) and letting the model figure out the reasoning that gets there. Most of the 2024–2026 jump in math and code performance has roots here.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.