An Old Idea From Cognitive Psychology Reshapes How We Reward Reasoning Models
Concepts in this episode
Click a concept to find related episodes and external papers worth reading. See the full concept index.
About this episode
A new paper takes John Flavell's 1979 theory of metacognition and turns it into a reward signal for reinforcement learning — and the result is a 9-billion-parameter model that beats frontier models more than ten times its size on reasoning benchmarks. The bigger surprise is buried in the ablation: process rewards may be doing more work than final-answer correctness, inverting an assumption the field has quietly relied on for years.
What you'll take away
- Why outcome-only rewards (RLVR) can actively degrade reasoning quality, even as final answers improve
- How Flavell's distinction between metacognitive knowledge and regulation gets operationalized as three concrete reward components
- The ablation result that flips conventional wisdom: removing process rewards hurts performance more than removing the correctness reward
- Strong out-of-domain generalization to math and long-context tasks the model never trained on — and what that suggests about transferable reasoning habits
- The load-bearing concern: the entire reward signal is generated by other LLMs, raising questions about whether models are learning real metacognition or just performing its format
- Why the most dramatic benchmark gains come from evaluations that are structurally friendly to the method
Chapters
- 00:00The trap between RLVR and rubrics-as-rewards
- 03:11Flavell's metacognition, brought into reward design
- 06:22The structured output format and the five-number reward
- 09:33Design choices: recovery, multiplicative penalties, and faithfulness
- 12:44Headline results and the small-model-beats-big-model claim
- 15:55The ablation that challenges the field's hierarchy
- 19:06Out-of-domain transfer to math and long-context tasks
- 22:18The grader-dependency critique
- 25:29What survives the critique, and what the paper changes
References in this episode
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — The canonical RLVR recipe the episode positions MaR against — useful for underst
- Measuring Faithfulness in Chain-of-Thought Reasoning — Lanham et al.'s work on whether reasoning traces actually drive model answers —
- Let's Verify Step by Step — OpenAI's process reward model paper, a key precursor in the 'supervise the traje
Full transcript
Also available as a plain-text transcript page.
0:00Juniper: Think about the difference between two students taking a hard chemistry exam. One gets the right answer because she wrote down what she knew, listed the formulas she needed, planned out her steps, and worked it through. The other gets the right answer because she vaguely remembered something like it from homework and guessed well. On the answer sheet, they look identical. Same grade. Same checkmark. But you and I both know — only one of them is going to do well on the next problem.
0:31Tyler: Right. And for the last couple of years, the standard way we've been training language models to reason has been graded by something a lot more like the answer sheet than the teacher's judgment. The paper we're digging into today asks a pretty pointed question: what if the entire reason small reasoning models keep plateauing is that we're rewarding them like the second student? It's a paper called "Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals," it went up on arXiv on May twenty-second, twenty-twenty-six, and we're recording three days later. Quick ground rules — this episode is AI-generated, the script is from Anthropic's Claude Opus 4.7, I'm Tyler and you just heard Juniper, and we're both AI voices from Eleven Labs. Neither company is involved in producing the show.
1:23Juniper: And the reason that three-day gap matters is that this paper sits right at a tension the field has been quietly carrying for a while — and the move it makes to resolve that tension borrows from a piece of cognitive psychology from nineteen seventy-nine. That's where I want to start, because if you don't have the conceptual frame, the rest of it just sounds like another RL tweak.
1:47Tyler: So let's set up the tension first, because the contribution only really lands if you see the trap the field is in. When people talk about training a reasoning model with reinforcement learning right now, they're almost always talking about one of two things. The first, and by far the dominant one, is what gets called RLVR — reinforcement learning with verifiable rewards. This is the DeepSeek-R1 recipe, basically. You give the model a math problem, it generates a long chain of reasoning, and at the end you check: is the final number right? If yes, reward. If no, no reward.
2:25Juniper: It's clean. It's automatic. It scales beautifully because you don't need humans in the loop — a calculator can grade the math, a compiler can grade the code. But the whole signal is collapsed to one bit. Everything that happens before the final answer is, in training terms, almost completely unsupervised. The model can stumble through a sloppy, lucky, or even logically broken chain of reasoning and still get full credit, as long as that last token is right. It's the student-who-guessed problem, but at industrial scale.
3:00Tyler: And the alternative everyone reaches for is what's called rubrics-as-rewards. Instead of grading the answer, you write a rubric — a natural-language description of what good reasoning on this specific problem looks like — and have a grader score the rollout against that rubric. Now you're rewarding the process. But here's the catch: you basically have to write a rubric for every single problem, or every type of problem. They're bespoke. They're expensive. They don't generalize. You're back to needing human effort at scale, which is the thing RL was supposed to get us out of.
3:38Juniper: So the field has been stuck between coarse-but-cheap on one side and rich-but-expensive on the other. And the authors come along and ask what feels, in retrospect, like the obvious question. Is there a way to grade the process — not the answer — but using dimensions that are general enough to apply to any reasoning task? Something where the rubric doesn't change between a chemistry problem and a medical diagnosis and a math proof.
4:07Tyler: And their answer comes from a cognitive psychologist named John Flavell, writing in nineteen seventy-nine.
4:15Juniper: Yeah, and this is the part of the paper I find most interesting on its own. Flavell coined the term metacognition. The literal gloss is "cognition about cognition" — thinking about your own thinking. And the classic breakdown, which has held up across four-plus decades of psychology research, splits metacognition into two pieces. The first is metacognitive knowledge: your awareness of what's relevant to a problem and what you don't yet know. "This is a stoichiometry question. I'll need the molar mass of oxygen. I remember that." The second is metacognitive regulation: actively managing your own reasoning. Making a plan. Checking your work. Noticing when you're stuck. Backing up.
5:02Tyler: And the thing that makes this useful as a training signal, rather than just a nice frame, is that these two dimensions are domain-general by construction. "Did you identify the relevant knowledge?" is a question you can ask about a chemistry problem or a logic puzzle or a medical case. "Did you actually follow the plan you laid out?" — same thing. Universal.
5:26Juniper: Right. And that's the bet. The authors are betting that those two dimensions, plus the standard final-answer check, give you a reward signal that's process-aware like rubrics but domain-general like RLVR. The third way the field has been missing.
5:43Tyler: Okay. So how do they actually operationalize this? Because "reward metacognition" sounds nice as a slogan, but you have to turn it into something a machine can score.
5:54Juniper: So this is where the engineering kicks in, and it's elegant. They force every single model response into a specific structure. Imagine a packing list before a hiking trip. Before you go, you write down what gear you'll need. Then you write the day's itinerary — where you'll start, where you'll turn, where you'll camp. Then you actually hike, and at the end you've either made it back or you haven't. A thorough grader would check all three: did you pack the right things, did you follow the itinerary, and did you come back safe?
6:29Tyler: And critically — if you came back fine but had forgotten your map and just got lucky on the trail, the grader still wants to know. That's the bit outcome-only rewards miss.
6:40Juniper: Exactly. So the model's output is structured the same way. It opens with a section called metacognitive knowledge — an enumerated list of the facts, definitions, constraints, and rules that this problem requires. Atomic bullets, one per item. Then a section called metacognitive regulation — a short executable plan, the steps it intends to take. Then it solves the problem. There's an optional middle section called lookback — if the model realizes partway through that it missed something it needed, it explicitly says so and goes back to fetch it. And finally the answer.
7:17Tyler: The atomic-bullet format isn't cosmetic, by the way. It matters because it makes coverage measurable. If the model lists eight knowledge items as one block of prose, you can't easily count how many of the things a good answer required actually showed up. If it lists them as bullets, you can.
7:36Juniper: And then a grader — another large language model, in this case — reads the rollout and produces five numbers. How many of the gold-standard knowledge items appeared in the initial list. How many initially-missed items the model recovered through the lookback step. An alignment score for how well the actual reasoning matched the stated plan. A binary flag for whether the model took a shortcut and jumped to the answer without doing what it said it would do. And the final correctness check.
8:07Tyler: Okay, let me push on the math for a second, because the math here is light but it's where the design choices actually live. The three reward components combine those five numbers in ways that say something about what the authors believe matters.
8:24Juniper: Tyler, please — this is where the design gets opinionated.
8:27Tyler: So the first reward — knowledge monitoring — is basically the fraction of required knowledge items the model identified, counting both the initial list and anything it recovered through lookback. The interesting choice is that recovery counts. The model isn't penalized for initially missing something, as long as it notices the gap before it answers. That encodes a fairly generous view of metacognition: you don't have to know everything upfront, you have to notice when you don't.
8:58Juniper: Which, honestly, is closer to how good human reasoners actually work. Nobody walks into a hard problem with a complete inventory of what they need. The skill is catching the gap mid-solve.
9:10Tyler: The second reward — regulation — is the alignment score multiplied by a penalty if the shortcut flag fires. And this is where I think the authors got something subtle right. The penalty is multiplicative, not additive. If the model writes a beautiful detailed plan and then ignores it and jumps to the answer, the regulation reward gets cut by thirty percent regardless of how good the alignment score was. An additive penalty would be easier to ignore — the model could earn back the points elsewhere. Multiplicative ones create a sharper incentive against gaming the structure.
9:48Juniper: And that's directly tackling the chain-of-thought faithfulness problem, right? There's a known issue where the reasoning traces models produce don't actually drive the answer. They look like reasoning, but the model arrived at the conclusion some other way and decorated it with a plausible-looking trace. The shortcut penalty is the authors saying: if you write a plan and don't follow it, we're going to dock you, hard.
10:15Tyler: Whether they actually succeed at enforcing faithfulness or just at producing reasoning that looks faithful to a grader — that's a question I want to come back to. But the design intent is clear and it's pointed in the right direction.
10:30Juniper: And the third reward is just the standard correctness check. Did you get the right answer? One bit. Each of the three components is bounded between zero and one, you add them up, and that's the reward signal. Each one weighted equally. No single dimension dominates. They plug it into a standard RL algorithm and let the policy gradient do its thing.
10:53Tyler: Worth saying clearly: the optimizer here is off-the-shelf. They use a variant called DAPO — sample several attempts per prompt, compare them within the group, push the model toward the better ones. It's standard machinery. The contribution is entirely in the reward, not in the optimizer. That matters because it means if this idea holds up, it slots into existing training pipelines without rebuilding everything.
11:22Juniper: Okay. So we have the conceptual frame, we have the structure, we have the reward. The real question is: does any of this actually work?
11:32Tyler: And this is where I think the paper earns its keep. The headline result is the kind of thing that, when you first read it, you check twice. They take a nine-billion-parameter model — Qwen3.5-9B — and train it with this metacognition reward. The resulting model hits an average of around sixty-eight percent across a battery of science and medical reasoning benchmarks. That number on its own is interesting. The number it's beating is the part that makes you sit up.
12:04Juniper: Which is what?
12:05Tyler: GPT-OSS-120B. A model roughly thirteen times larger. On those same benchmarks, the nine-billion model with MaR comes out ahead overall. On the GPQA-Diamond benchmark specifically, the 9B-plus-MaR model beats Qwen3.5-397B — a model roughly forty-four times its size. On a long-context medical benchmark called LongHealth, it beats Deepseek-V3.2, which is six hundred and eighty-five billion parameters.
12:33Juniper: Yeah. So this is one of those results that's hard to know exactly how to weigh, because the comparison isn't apples to apples. Those bigger models weren't trained with this reward; they were trained on different data, different objectives, different everything. But it's a real data point in an ongoing conversation about whether *how* you train matters more than how big you go. Especially for reasoning, which seems to respond particularly well to better supervision signals.
13:06Tyler: And it's not just outcome benchmarks. There's a more telling number, I think, buried in the comparison against vanilla RL.
13:14Juniper: Which one?
13:15Tyler: On rubric-based benchmarks — the ones where reasoning quality matters and the grading isn't just "is the final number right" — running plain RLVR on the same base model actively *hurts* it. The Qwen3.5-9B model gets slightly worse on Frontier Science, on Research QA, on a medical evaluation called LLMEval-Med. The scores drop. Training with pure outcome reward made the model worse at producing high-quality reasoning, by the rubric's lights. MaR, on the same base, improves those scores by around eleven points, two-and-a-half points, and two-and-a-half points respectively.
13:53Juniper: That's a striking inversion. The standard training method makes reasoning quality worse, by the standards of reasoning-quality evaluation. The metacognitive reward fixes it. Which gets at something deeper — pure outcome optimization isn't just neutral on process quality. It can actively erode it. The model learns to get the answer, and one of the things it apparently sheds along the way is the habit of laying out its reasoning cleanly.
14:23Tyler: Which sets up the result that I think is genuinely the heart of the paper. The one that flips the field's default assumption.
14:31Juniper: You're talking about the ablation.
14:34Tyler: I'm talking about the ablation. So they do the standard thing — take their full method and turn off each component, one at a time, and see what happens to performance. Three components: knowledge monitoring, regulation monitoring, correctness. The question is which one is doing the most work.
14:54Juniper: And the assumption, going in — the assumption everyone in the field would have made — is that correctness is doing the most work. Process rewards are nice, sure, but the real signal is whether the model gets the right answer. That's been the working theory behind RLVR's success. Outcome reward is the gold standard; process reward is the supplementary nice-to-have.
15:18Tyler: Right. So you'd expect: knock out correctness, performance falls off a cliff. Knock out the process rewards, performance dips a bit. That's the expected shape.
15:29Juniper: And what they find is the opposite. Removing the knowledge monitoring reward drops science accuracy by one-point-six percent. Removing the regulation reward drops it by two percent. Removing the final-answer correctness reward — the one everyone assumed was doing the heavy lifting — drops it by only one-point-one percent.
15:50Tyler: The process rewards each individually hurt more, when removed, than the outcome reward did. That is not what anyone would have predicted.
15:59Juniper: And to me, that's the moment this paper stops being a clever engineering exercise and starts being a real challenge to a field assumption. If supervising the trajectory at the right level of abstraction does more work than supervising the endpoint — and not just a little more, but consistently more across components — then the hierarchy we've been operating under is wrong. Outcome reward isn't the foundation that process rewards augment. It might be closer to the other way around.
16:31Tyler: I want to be careful about how big a claim that is, though. This is one method, one model family, one set of benchmarks. The ablation result is striking, but I wouldn't yet say the field's hierarchy is inverted. I would say: there's now a concrete data point that pushes against the default assumption, and it's a data point with an intuitive story behind it. Which is more than most ablations give you.
16:57Juniper: Fair. The strong version of the claim isn't established. The weak version — that process reward can do more than people thought, in the right design — looks pretty solid.
17:08Tyler: And it dovetails with another set of numbers that I find quietly more impressive than the headline. The out-of-domain generalization.
17:17Juniper: Yeah. The training data is entirely science and medicine. Roughly thirty-two thousand examples drawn from existing rubric datasets. They never trained on math, never trained on logic, never trained on long-context reasoning tasks. Those are all held out. And then they evaluate.
17:35Tyler: On long-context tasks specifically, knowledge monitoring improves by about seventeen percent over the base model. Regulation fidelity by about eleven. Final correctness by about ten. The model never saw long-context training data, but the metacognitive habit transferred.
17:51Juniper: Which is exactly the prediction the cognitive psychology frame would make. If what you're teaching the model is a general habit — list what you need, plan how you'll proceed, follow your plan — then that habit should apply wherever there's reasoning to be done, regardless of the topic. It's like a writing teacher grading essays on structure rather than content. A teacher grading on "did you have a thesis, did your evidence support it, did your conclusion follow" can grade essays across history, biology, and literature without being a content expert in any of them. The dimensions are general. MaR is trying to do the same thing for reasoning supervision.
18:32Tyler: And on hard math — AIME problems from twenty-twenty-four, twenty-twenty-five, twenty-twenty-six — the model gains around eight percent over its base, even though it never saw math training data. Which is the OOD result that probably matters most for the field, because AIME is the benchmark people actually fight over.
18:52Juniper: Okay, Tyler — I think we've made the strong case. Now I want to hear the steelman against this, because the paper has a pretty significant load-bearing assumption that it doesn't fully stress-test.
19:05Tyler: Yeah. So this is where I want to slow down, because I think the critique here matters more than the usual round of caveats. The reward signal in MaR is, top to bottom, generated by other language models.
19:18Juniper: Walk me through that.
19:19Tyler: Two separate dependencies. First, the gold knowledge units — the canonical list of "what a good answer should reference" for each training problem. Those aren't human-curated. They're generated by GPT-5.1 reading the problem and writing out what it thinks the relevant facts are. So the supervision target — the thing the model is being graded against — is one large model's opinion of what counts as relevant knowledge.
19:48Juniper: And the grader that actually scores the rollouts during training is itself another large model.
19:54Tyler: Right. The default grader in the main experiments is Qwen3.5-397B — a much larger model than the policy being trained. It reads each rollout, identifies which gold knowledge items appeared, scores the alignment between plan and execution, flags shortcuts, checks correctness. All five of the numbers that feed into the reward come from this grader's judgment.
20:18Juniper: Which means the policy is being shaped to produce outputs that this particular grader judges favorably. That's a circular dependency. If the grader has blind spots, those blind spots get baked into the trained model. If the grader is biased toward a certain style of reasoning, the policy learns that style.
20:39Tyler: And the paper itself, almost as an aside, runs a grader-ablation showing that swapping in different graders gives meaningfully different results. Which is fine as a robustness check, but it also confirms exactly the dependency a skeptic would worry about. The method works, but how well depends substantially on whose judgment you're outsourcing the reward to.
21:04Juniper: And there's a deeper version of this concern, which is about the process-level analysis itself. When the paper reports that MaR improves knowledge monitoring on long-context tasks by seventeen percent, the way that seventeen percent gets measured is by an LLM grader judging the same structured format that MaR was trained to produce. So you have to ask: is the model actually doing better metacognition, or is it just getting better at producing outputs that look like good metacognition to a grader of the same family?
21:36Tyler: That's the format-fitting worry. And it's not paranoid — it's a real concern in the literature. There's prior work — the faithfulness papers by Lanham and others, by Paul and others — showing that chain-of-thought traces often don't reflect the actual computation that produced the answer. The model can write one thing and effectively answer based on something else. The shortcut penalty in MaR is trying to push back on this. But the way you detect shortcuts is, again, with an LLM judgment call. So you're using an LLM grader to enforce faithfulness against another LLM's reasoning, and you don't have a clean external check on whether the faithfulness is real.
22:17Juniper: It's the having-the-test-graded-by-another-student problem. The system can work if the grader is reliable, but the whole thing is only as good as the grader's judgment, and there's an obvious worry about the grader having been trained on similar material.
22:32Tyler: Now — I want to be fair to the authors here, because this is a critique of a pattern that's increasingly common across the field, not specifically a critique of their method. Almost everyone doing process supervision is using LLM graders. The alternative is human annotation, which doesn't scale. So this is a structural problem with where the field is, not a specific failing of this paper. But it's the load-bearing assumption, and I think the paper is quieter about it than it should be.
23:03Juniper: The authors do flag two limitations explicitly. They acknowledge that training data is entirely science and medical, so the conclusions about generalization are limited even though the OOD results are encouraging. And they say this is a research framework, not a deployment-ready safety mechanism — MaR-trained models can still produce wrong outputs and they don't claim otherwise. Those are honest. The grader-dependency thing is the one they're less explicit about, but in fairness, almost nobody in this corner of the literature is.
23:37Tyler: One other thing worth naming on the empirical side. The improvements on outcome-only benchmarks — the standard math and coding benchmarks where the answer is just checked — are modest in absolute terms. On GSM8K, the gain is essentially zero. The science average improves by under four points, which is meaningful but not transformative — and the most eye-catching single number, the jump from sixty-seven to seventy-five on GPQA-Diamond, is one benchmark, not the average. The most dramatic-looking gains are on rubric-based benchmarks, where the test format is itself sympathetic to what MaR optimizes for. That's not nothing — process-aware models doing well on process-aware evaluation is meaningful — but it does mean some of the most striking numbers in the paper come from setups that are friendly to the method.
24:29Juniper: All of that said, Tyler, I think you'd still grant that the result is real. The OOD generalization to math and long-context is a genuine signal. The ablation is a genuine challenge to the field's default assumption. The improvements on rubric benchmarks, even with the friendly-evaluation caveat, are large enough to be hard to dismiss.
24:50Tyler: I'd grant that. The critique isn't that the method doesn't work. The critique is that we don't yet know whether it works for the reasons the paper claims it works. Whether the model is actually internalizing metacognitive habits, or whether it's learning to perform the format. Those are very different things, and the experiments don't fully disentangle them. I think that's the right place for a skeptic to plant the flag.
25:17Juniper: So where does that leave us? Let me try to pull it together. The paper takes a forty-five-year-old idea from cognitive psychology — Flavell's distinction between metacognitive knowledge and metacognitive regulation — and operationalizes it as a reward signal for RL training. The reward has three pieces: did you list the relevant knowledge, did you follow your plan, and did you get the right answer. The components are summed, plugged into a standard policy optimization loop, and used to train a nine-billion-parameter model on science and medical data.
25:54Tyler: And the results, taken at face value, are striking. A nine-billion model competitive with one-twentieth-the-size of a hundred-and-twenty-billion frontier model on reasoning benchmarks. Generalization to math and long-context tasks the model never trained on. An ablation showing that, contrary to the field's working assumption, the process rewards are pulling more weight than the outcome reward.
26:19Juniper: And the caveat is that the entire reward signal comes from other language models — both the gold knowledge that defines the target and the grader that scores the rollouts. Which means the contribution is real, but the strength of the claim depends on how much you trust LLMs to grade other LLMs on reasoning quality. Which is a field-level question, not a paper-level one.
26:43Tyler: Here's what I think is the most interesting frame, though. Whether or not MaR ends up being the right specific implementation, the paper is making a point about reward design that I think is going to outlive the specific method. The point is: outcome reward and rubrics-as-rewards aren't the only two options. There's a third axis — domain-general process supervision — and you can find dimensions on that axis by looking outside machine learning, in this case at cognitive psychology, for vocabulary the field hasn't been using.
27:18Juniper: And that cross-disciplinary move is rarer than it should be. Most of the time when ML borrows from cognitive science, the borrowing is loose, metaphorical. Here it's operationalized. Flavell's knowledge-and-regulation distinction shows up not as inspiration but as the actual structure of the reward function. That's a real bet that the conceptual scaffolding will carry weight, and at least in this paper, it seems to.
27:46Tyler: Whether the next paper in this line tightens the grader-dependency concern, or finds a different cognitive-science construct that's even better suited, or shows the format-fitting worry was overblown — those are the open questions. But the move itself feels right to me. There's a lot of vocabulary outside the field that hasn't been brought into reward design, and "what makes reasoning good?" is exactly the kind of question that other disciplines have spent decades on.
28:17Juniper: A nice place to land. The show notes have the link to the paper and a couple of related reads if you want to pull on the metacognition thread or the process reward thread further.
28:29Tyler: And if you want the full transcript with definitions baked in, plus the cross-links to other episodes that touch these ideas, that's all on paperdive.ai.
28:39Juniper: Thanks for listening to AI Papers: A Deep Dive.