Theme · 14 episode(s)

RL for Reasoning

Definition

Reinforcement learning for reasoning trains models to produce useful chains of thought by rewarding correct final answers (or verified intermediate steps) and letting the model figure out the reasoning that gets there. Most of the 2024–2026 jump in math and code performance has roots here.

Episodes covering this

187
An 8-Billion Agent That Beats Models 80 Times Its Size By Looking Things Up
An AI agent for treatment reasoning over a biomedical tool universe
Gao, Noori, Zhu et al. · Department of Biomedical Informatics·19 min·Jun 30, 2026
165
A Free-Lunch Tweak That Lets a Tiny Agent Beat Frontier Giants
Group-Graph Policy Optimization for Long-Horizon Agentic Reinforcement Learning
Wang, Song, Zhang et al. · Peking University·22 min·Jun 23, 2026
163
Why Training Only on Perfect Solutions Cripples a Model's Reasoning
Provable Benefits of RLVR over SFT for Reasoning Models: Learning to Backtrack Efficiently
Wei, Kim · Princeton University·22 min·Jun 23, 2026
141
How Two Tokens Reopened a Reasoning Method the Field Had Given Up On
Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning
Yang, Chen, Wu et al. · HKUST(GZ)·29 min·Jun 12, 2026
133
How MiniMax Turned a Reward-Hacking Disaster Into Olympiad Gold
MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling
Chen, Zhang, Zhang et al. · MiniMax / The Chinese University of Hong Kong·34 min·Jun 12, 2026
101
Treating Math Formalization Like a Codebase, and Where the Agents Cheat
Formalizing Mathematics at Scale
Rammal, Patel, Gloeckle et al. · FAIR at Meta / CERMICS·27 min·May 29, 2026
084
Terminal Agents Get Free Supervision From The Tokens We've Been Throwing Away
ECHO: Terminal Agents Learn World Models for Free
Shrivastava, Kauffmann, Awadallah et al. · Microsoft Research·26 min·May 26, 2026
081
When Reasoning Models Decide Before They Think: Detecting and Fixing Premature Confidence
Understanding and Mitigating Premature Confidence for Better LLM Reasoning
Gai, Zeng, Baek et al. · Carnegie Mellon University·25 min·May 26, 2026
079
An Old Idea From Cognitive Psychology Reshapes How We Reward Reasoning Models
Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals
Chen, Xu, Zhao et al. · Tongji University / Shanghai AI Laboratory / Nanyang Technological University·29 min·May 25, 2026
073
When Three LLMs Talk to Each Other, Their Ideas Quietly Stop Moving
Multi-LLM Systems Exhibit Robust Semantic Collapse
Kong, Lai, Piao et al. · University of Toronto·28 min·May 23, 2026
048
How a 30B Open Model Reached Olympiad Gold With the Right Recipe
Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
Li, Zhan, Zhang et al. · Shanghai AI Laboratory / The Chinese University of Hong Kong·31 min·May 16, 2026
041
When the Iteration Teaches the Model to Skip the Iteration
Solve the Loop: Attractor Models for Language and Reasoning
Fein-Ashley, Rashidinejad · University of Southern California·30 min·May 13, 2026
026
What RL Actually Does to Language Models, at the Token Level
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
Akgül, Kannan, Neiswanger et al. · University of Southern California·24 min·May 08, 2026
009
How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers
SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
Limozin, Durech, Hoefler et al. · ETH AI Center·23 min·May 02, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.