All episodes

Episode 197 · Jul 03, 2026 · 17 min

Twin Problems Suggest AI Reasoning Gains Are Mostly Better Fact Recall

Abdaljalil, Serpedin, Kurban

NLP

AI Papers: A Deep Dive — Episode 197: Twin Problems Suggest AI Reasoning Gains Are Mostly Better Fact Recall — cover art

paperdive.ai

Listen

Ep. 197

Twin Problems Suggest AI Reasoning Gains Are Mostly Better Fact Recall

0:00

17 min

Concepts in this episode

Evaluation & Benchmarks Training Methods Chain of Thought Test-Time Compute Ablation Studies LLM-as-Judge Synthetic Data Reproducibility GPQA Knowledge vs. Reasoning Benchmark Contamination

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

IsoSci: A Benchmark of Isomorphic Cross-Domain Science Problems for Evaluating Reasoning versus Knowledge Retrieval in LLMs

Venue

arXiv:2607.01431

Year

2026

Read the paper

arxiv.org/abs/2607.01431

Also available on

Apple Podcasts Spotify

OpenAI's reasoning model beats its ordinary sibling by nineteen points on one science benchmark — and loses by twenty-five on another covering the same sciences. A new paper from Texas A&M explains the reversal with a simple counting trick: build twin problems with identical logic but zero shared facts, and watch whether reasoning gains travel between them. More than nine in ten don't — suggesting the industry's expensive 'reasoning premium' may mostly be buying a longer sweep of the model's memory, not better logic.

What you'll take away

Why every science benchmark fuses two separate skills — knowing facts and executing procedure — making a gain in fact-fishing indistinguishable from a gain in logic
The twin-problem trick: 144 problem pairs with identical solution steps but zero shared knowledge, letting you test whether an improvement travels with the logic or stays with the facts
Across five model pairs, 63 of 69 reasoning-mode gains were one-sided — over nine in ten stayed with the facts, though the authors flag this as a ceiling, not an exact figure
The cleanest experiment in the paper: toggling reasoning on the same model was a statistical wash (helped 8 items, hurt 9 on Gemini 2.0 Flash), suggesting visible extended thinking bought nothing on short procedural problems
Where the paper is soft: a contamination asymmetry between seen and fresh twins, 23% of API calls excluded due to token-cap truncation, and a mostly multiple-choice format all make the dramatic 25-point reversal attackable
What this doesn't cover — twenty-step derivations and open-ended problems, where the GPQA result hints extended reasoning may still earn its keep

Chapters

00:01One model, two tests, opposite verdicts
02:03Why benchmarks can't see what improved
02:53The chicken-and-mushroom trick behind IsoSci
06:08How to catch a cheat sheet
09:07Sixty-three to six
09:54Same model, reasoning on: coin flip
11:03Sprinter versus marathoner: the paradox dissolves
12:31The loudest number is the softest

References in this episode

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — The paper that launched the 'thinking longer means smarter' narrative this episo
GPQA: A Graduate-Level Google-Proof Q&A Benchmark — The benchmark where o3-mini wins by nineteen points in the episode's cold open,
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models — Apple's earlier use of the same controlled-contrast trick — swapping surface det
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity — A complementary skeptical result on reasoning-mode models that speaks directly t

Full transcript

Also available as a plain-text transcript page.

0:00Bella: For the last three years, the story of AI progress has run on one idea: models that think longer are smarter. Chain-of-thought, reasoning-mode training, extra compute at answer time. The benchmark scores went up, and the narrative wrote itself. This week, a paper out of Texas A&M put a crack straight through it. Take o3-mini, OpenAI's reasoning-specialized model, and GPT-4o-mini, its ordinary sibling. On GPQA Diamond, a graduate-level science exam, the reasoning model wins by about nineteen points. On a new benchmark called IsoSci, covering the same sciences, it loses by about twenty-five. Same two models. A forty-four point swing, depending purely on which test you administer.

0:47Finn: Quick tag before we go on: this is an AI-made explainer, both voices included. So one of those benchmarks has to be lying, right? Nineteen up or twenty-five down. They can't both be measuring "good at science."

1:02Bella: Neither one is lying, and that's the uncomfortable part. By the end of this you'll be able to decode that reversal with a single counting trick. And that same trick says more than nine in ten of the gains you get from flipping on reasoning mode come from surfacing facts, with only a sliver left over for anything you'd call better logic. This matters well beyond one leaderboard, because reasoning is currently the industry's central pricing lever. Reasoning models cost more per token, burn more compute, and are sold as a different class of intelligence.

1:40Finn: Let me make the case for that pitch, because it's not stupid. A reasoning model writes out its steps. First find the formula, then substitute, then compute. Steps look like logic. So when you train a model to deliberate and its science scores jump, the natural read is that the logic improved. That's the story everyone, including me, has been telling.

2:04Bella: And the problem with that story is buried in the questions themselves. Every science problem demands two separate things at once. You have to know something — the ideal gas law, an acid constant — and you have to do something with it: recall, substitute, compute. Standard benchmarks fuse those into one score. When the score goes up, you can't see which ingredient moved. The knowing and the doing are tangled in every single item, so a gain in fact-fishing is indistinguishable from a gain in logic. Untangling them takes a purpose-built instrument, and building it is the whole paper.

2:43Finn: Which sounds impossible on its face. How do you test the doing without the knowing? Every question about gas pressure requires knowing about gas.

2:53Bella: You can't strip the knowledge out, Finn. So instead you hold the logic constant and swap the knowledge. It's the oldest move in experimental science: change one variable, freeze everything else. Think of two recipes with an identical procedure — sear, deglaze, reduce, plate — but zero shared ingredients. One is chicken in white wine, the other is mushrooms in stock. A cook who mastered the technique executes both. A cook who memorized the chicken dish flails at the mushrooms. IsoSci is a benchmark of one hundred forty-four twin problems built exactly like that. Here's the pair on screen now, and it's the running example for everything today. Left side: a two-mole sample of ideal gas at three hundred kelvin fills forty-nine liters, find the pressure. Right side: a weak acid at a given concentration and dissociation constant, find the pH.

3:51Finn: And those look like completely different problems.

3:54Bella: They are, at the level of facts. Watch the procedure column light up between them, though: recall a formula, plug in the given values, compute. Identical, step for step. But the formulas themselves share nothing. Knowing the gas law tells you nothing about the pH approximation. Same recipe, disjoint ingredients. That's what "isomorphic" means here: same shape. The benchmark holds five of these short procedural shapes, all three-to-five-step problems, spanning physics, chemistry, biology, and earth science.

4:29Finn: How do you manufacture a hundred and forty-four of these? Someone has to write the mushroom version of every chicken problem.

4:37Bella: An LLM does. They seeded from existing benchmarks, had Claude generate the twin in a different domain, then filtered hard: a three-model judge panel checking logical equivalence and knowledge disjointness, plus a human expert audit that found the automated filter agreed with human consensus about ninety percent of the time. Only about a third of candidates survived. And the entire operation, construction plus eleven thousand evaluation calls, cost under a thousand dollars in API fees. A shoestring experiment aimed at an expensive narrative.

5:14Finn: Hold on, though. Claude generated the problems and sat on the jury approving them. And then the paper uses those problems to indict LLM reasoning. That's the fox auditing the henhouse.

5:27Bella: The authors saw that coming. They excluded every Anthropic model from the evaluation itself, and they re-ran acceptance with Claude removed from the judge panel: exactly one pair out of a hundred forty-four would have changed. That worry is handled.

5:43Finn: Fine. But here's one that isn't, and I want it on the record early. Each twin pair has a lopsided birth: one problem comes from a public benchmark that models plausibly saw during training, and the other is freshly written, definitionally unseen. That asymmetry sits inside every measurement they make. Park it. I'm coming back to it.

6:06Bella: Noted, and it will cost me later. So the twins exist. The counting logic is next. The paper's entire empirical core is one tally, and it pays off in a single ratio: sixty-three to six. Three things to track before we count. First, the twins: two problems, same procedure, zero shared facts. Second, a gain: any item that reasoning mode flips from wrong to right. Third, two kinds of model pairs: traditional pairs, meaning a reasoning-trained model against a standard sibling like o3-mini versus GPT-4o-mini, and toggle pairs, the same model with the provider's reasoning flag switched on or off. Same weights, same everything.

6:49Finn: The toggle being the same student with and without scratch paper, while the traditional pair is two different students, one of whom went to a fancier school.

7:00Bella: Exactly the right split, and it decides which claims are causal. Now the counting game. Imagine a student takes twin exams — every question on exam A has a partner on exam B with the same solution steps but different facts. You give the student some intervention, and their score improves. Here's the diagnostic: if the intervention sharpened their general problem-solving, the improvement should appear on both twins, because logic is the only thing the twins share. If it appears on one twin only, the intervention helped with that twin's facts. It was a cheat sheet, not a skill. That's the paper's metric: for every gain reasoning mode produces, check whether it travels to the structurally identical twin.

7:49Finn: Wait — why does a one-sided gain have to be knowledge? Maybe the model just reasoned harder on that one problem and never got around to the twin.

8:00Bella: There's no per-problem luck to appeal to here — everything was sampled exactly once at temperature zero, so the model can't "reason harder" on one run than another, and the same lopsided pattern showed up across five model pairs from four different families. The design forces the inference: the twins share only structure, they differ only in facts, so a gain that stays home traveled with the facts. There is one honest wrinkle, and the authors flag it themselves. If a real logic improvement rescues only the harder twin, because the easier twin was already solved, it gets miscounted as a knowledge win. So whatever number comes out is a ceiling on knowledge-dependence, and I'll say it that way from here on.

8:48Finn: So the whole method in one line: how do you tell a knowledge win from a reasoning win?

8:54Bella: If the gain shows up on both twins, the logic got better; if it shows up on one, the facts did.

9:00Finn: Then run it. If the industry narrative is right, gains should transfer constantly. Both twins, over and over.

9:08Bella: Across five model pairs from four families, they found sixty-nine gains total. Sixty-three of them were one-sided. More than nine in ten of the improvements from reasoning mode stayed with the facts and never traveled with the logic. Not one of the five model pairs dipped below roughly seventy-eight percent knowledge-dependent. And remember, that ninety-one percent figure is the ceiling reading, but you'd need it to be off by a mile before the picture changes.

9:40Finn: That's the traditional pairs, though. Different students, different schools. o3-mini and GPT-4o-mini differ in their whole training, so you can't pin anything on the reasoning itself. What did the scratch paper say?

9:54Bella: This is the beat I'd rewind for, so let it land. Toggle pairs: identical model, identical weights, reasoning flag on versus off. Any difference isolates the visible extended thinking itself. Gemini 2.0 Flash, reasoning on... helped on eight items. And hurt on nine.

10:12Finn: Eight and nine. It's a coin flip.

10:15Bella: Statistically indistinguishable from zero, and the other toggle model, Qwen3, came out twenty-one against twenty. On these short procedural problems, flipping the reasoning switch on the very same model bought nothing systematic at all. So here's the paper's interpretation, and it's the image I want you to keep. You've lost your keys, and you pace the apartment narrating aloud — I came in, I put down the groceries, I answered the phone — until the memory surfaces. The talking gave your memory more chances to fire. On these tasks, extended reasoning works like that pacing: more tokens means more chances for the right formula to surface mid-generation. What happens after the fact surfaces doesn't improve.

11:03Finn: Which finally decodes the cold open. If reasoning mode is mostly a longer memory sweep, then o3-mini's nineteen-point win on GPQA and its twenty-five-point loss on IsoSci stop being a paradox.

11:16Bella: Right. It's asking whether a sprinter or a marathoner is the better runner. Stage the hundred meters, the sprinter wins by a mile; stage the marathon, the opposite. Neither result is wrong; the question was never well-formed. GPQA is graduate-level material heavy on conceptual depth and multi-step derivation, exactly the terrain the paper says favors reasoning-specialized models. IsoSci is short, procedural, and knowledge-matched by construction. On it, the plain GPT-4o-mini out-answered its reasoning sibling on eighty-seven items where the reversal happened only sixteen times. Five to one. The sting, and the reason this is more than a cute sports analogy: with runners, we know in advance what each race measures. Nobody knew GPQA and IsoSci were different races. The whole field was scoring them as the same event.

12:12Finn: Okay. Checkpoint, because we've made three moves: benchmarks fuse knowing and doing, twin problems un-fuse them, and once un-fused, reasoning gains almost never travel with the logic. Now, Bella, the reservation I parked. Because I think the loudest number in this paper is also its softest.

12:31Bella: Go.

12:31Finn: Three cuts. First, the twins' lopsided birth. Source problems come from public benchmarks the models may have eaten during training; target twins are fresh and unseen. In the pooled data, source-side gains outnumber fresh-side gains forty-one to twenty-two, nearly two to one. The paper cautiously calls that a difficulty effect. A skeptic calls it a take-home exam where one twin's questions were circulating in the study group. Memorization of public benchmarks is a documented, boring, ever-present confound, and this design can't fully exclude it. Second, the reversal itself. Twenty-three percent of all API calls got excluded, concentrated in reasoning modes slamming into the output-token cap. On one seed benchmark, twenty-eight percent of reasoning-mode responses cut off versus eleven for standard. Systematically truncating the chattier model's answers could manufacture part of that dramatic twenty-five-point loss. Third, format. Over seventy percent of IsoSci is four-way multiple choice. So "IsoSci reveals o3-mini's weakness" and "IsoSci's format rewards GPT-4o-mini's strengths" are two captions on the same data.

13:45Bella: On the reversal, Finn, you win, and I'll say it plainly: the size of that twenty-five-point number is attackable, and the truncation issue means it deserved more scrutiny than the paper gave it. What I'd defend is what doesn't route through the reversal at all. The transfer result and the toggle wash stand on their own, and the toggle comparisons are the same model against itself, so contamination hits both configurations equally.

14:15Finn: Mostly agreed. The direction survives me; the decimal doesn't. Sixty-nine total gain events, some model pairs contributing eight pairs each, and a metric the authors themselves call an upper bound. Quote the paper as "reasoning gains on short procedural problems are overwhelmingly knowledge-dependent." Don't quote it as "ninety-one point three percent," because that precision is borrowed, not earned. And one scope line the authors are honest about: the toggle only suppresses visible chain-of-thought. The model might still deliberate silently. So the clean causal claim is about visible extended reasoning, on three-to-five-step problems, full stop.

14:57Bella: And that scope cuts both ways, which is the fair note to end the science on. This paper says nothing about twenty-step derivations, open-ended research problems, or hypothesis generation. And the GPQA result hints extended reasoning might earn its keep there. What it does cover is a huge slice of real usage: homework help, engineering sanity checks, lab computations. For that slice, the reasoning premium buys you a longer sweep of the model's memory. Useful, sometimes decisive, but a different product than the one on the label. And the method outlives the finding. Building benchmarks as controlled contrasts instead of piles of hard questions is a template any lab can copy, for any capability, at a few hundred dollars a run.

15:45Finn: So re-read the cold open with what you know now. Nineteen up on one test, twenty-five down on another only looks like a contradiction until you see it as two races measuring two blends of the same two ingredients. "Which model is better at science" was never a well-formed question, and the tests we trusted to answer it never separated remembering from thinking until someone built twins.

16:10Bella: So pick your camp. Is extended reasoning a real second capability that three-to-five-step problems are simply too short to reveal? Or is it retrieval all the way up, and the long-horizon gains will decompose exactly the same way once someone builds twins for them? Tell us which side you'd bet on, and what evidence would flip you. The full annotated version of this episode is at paperdive.ai — every technical term tap-to-define, with related papers linked by theme.

16:44Finn: Housekeeping, quickly: this script was written by Anthropic's Claude Fable 5; Bella and I are AI voices from Eleven Labs, and the producer is affiliated with neither company. The paper is IsoSci, from Texas A&M and Hamad Bin Khalifa University, published July first, 2026. We recorded this July third.

17:05Bella: The thing to watch for now: the first twin-pair benchmark built on long derivations. If reasoning gains transfer across those twins, the premium is real. If they don't, the industry just paid three years of markup for a longer walk around the apartment.