Concept · 6 episode(s)

Process Reward Models

Definition

Process reward models score each step of a reasoning trajectory rather than just the final answer, giving denser feedback for training and search. They’re harder to build than outcome reward models — you need step-level labels — but they support much more capable reasoning-time search.

Episodes covering this

183
Why You Can't Fine-Tune Foresight Into an AI Agent
Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning
Zhang, Zhou, Qiao et al. · Fudan University / Shanghai Innovation Institute / Tencent Youtu Lab·23 min·Jun 29, 2026
173
The Free Step-Level Grader Hiding in Every RL Training Run
Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents
Oh, Li, Park et al. · University of Wisconsin–Madison·22 min·Jun 25, 2026
170
When a One-Liner Beats Your Agent's Clever Verification Logic
Bayesian control for coding agents
Papamarkou, Smirnov, Mazanov et al. · PolyShape / National Technical University of Athens·26 min·Jun 24, 2026
081
When Reasoning Models Decide Before They Think: Detecting and Fixing Premature Confidence
Understanding and Mitigating Premature Confidence for Better LLM Reasoning
Gai, Zeng, Baek et al. · Carnegie Mellon University·25 min·May 26, 2026
079
An Old Idea From Cognitive Psychology Reshapes How We Reward Reasoning Models
Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals
Chen, Xu, Zhao et al. · Tongji University / Shanghai AI Laboratory / Nanyang Technological University·29 min·May 25, 2026
008
Why Long-Horizon AI Agents Get Stuck, and a Milestone-Based Fix That Helps
A Subgoal-driven Framework for Improving Long-Horizon LLM Agents
Wang, Gooding, Hartmann et al. · Google DeepMind·24 min·May 02, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.

Let's Verify Step by Step