All episodes

Episode 191 · Jul 02, 2026 · 26 min

How One Researcher Beat GPT-5.2 and Gemini 3 by Judging Their Answers, Not Improving Them

Land

Test-time Compute

AI Papers: A Deep Dive — Episode 191: How One Researcher Beat GPT-5.2 and Gemini 3 by Judging Their Answers, Not Improving Them — cover art

paperdive.ai

Listen

Ep. 191

How One Researcher Beat GPT-5.2 and Gemini 3 by Judging Their Answers, Not Improving Them

0:00

26 min

Concepts in this episode

Test-Time Compute Evaluation & Benchmarks AI Agents Parallel Sampling LLM-as-Judge Multimodal Models Chain of Thought Agent Benchmarks Inference-Time Scaffolding Strategy Diversity Pass@k Metric Credit Assignment

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2

Venue

arXiv:2606.31543

Year

2026

Read the paper

arxiv.org/abs/2606.31543

Also available on

Apple Podcasts Spotify

A solo researcher outscored the flagship configs of GPT-5.2 Pro and Gemini 3 Pro on the hardest reasoning benchmark by more than eighteen points — using those exact models, without training anything smarter. The trick: on genuinely hard puzzles the popular answer is almost always the trap, so the whole game is selection, not generation. You'll come away with a concrete rethink of what test-time compute should actually buy you.

What you'll take away

Why majority voting fails hardest exactly on the puzzles that matter — the crowd converges on the same tempting wrong assumption, so more votes buries the lone correct answer
How treating problem modality (text, image, code) as the axis of diversity beats simply sampling one model hot many times — and why the renders are deliberately blurred
What 'holistic judging' does: reading all candidates' full reasoning traces side by side recovered 7 minority answers for only 13% of total system cost — the cheap phase does the decisive work
The counterintuitive prompting finding: every attempt to structure or template the reasoning made it worse, a 'compliance tax' that collapses the diversity the system depends on
Where the evidence is soft — the '+7 from judging' comes from re-scoring one run, not a head-to-head, and the component attributions are educated inference, not proof
A striking infrastructure reality: 84% of the GPT-5.2 API calls failed, roughly doubling the cost through wasted retries

Chapters

00:31Why the popular answer is a trap
01:58The spaceship puzzle nothing could solve
04:00A whodunit where the crowd is the red herring
06:03Three specialists, one puzzle, blurry images
08:45The jury that reads every argument
13:15When zero candidates got it right
14:47Why structuring the prompt made it worse
17:59How much of the story actually holds up
22:57What test-time compute should really buy

References in this episode

On the Measure of Intelligence — François Chollet's paper introducing the ARC benchmark and its founding thesis —
Self-Consistency Improves Chain of Thought Reasoning in Language Models — The canonical majority-vote-over-samples method that this episode argues fails p
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Foundational work on using models as evaluators, relevant to the episode's holis
Large Language Models are Zero-Shot Reasoners — The 'Let's think step by step' result on minimal prompting, a useful counterpoin

Full transcript

Also available as a plain-text transcript page.

0:00Cassidy: One independent researcher, working solo with an API budget, beat the flagship configurations of GPT-5.2 Pro and Gemini 3 Pro on the hardest AI reasoning benchmark going — by more than eighteen points. And the part that should bother you: he did it by calling those exact models. He didn't train anything smarter. He built a better way to decide which of their answers to trust.

0:26Eric: Quick heads up before we start — this is an AI-made explainer, both voices included.

0:32Cassidy: The benchmark is ARC-AGI-2, and the score is roughly seventy-three percent, against about fifty-four for the best single models. What you'll understand by the end is why that gap exists — why the whole game on these puzzles turns out to be selection, not generation. The models already produce a right answer more often than you'd think. They just can't tell it's the right one.

0:58Eric: And that's the counterintuitive core, right? Because the obvious move when a model is unsure is to sample it a bunch of times and take the answer it lands on most. Majority vote. That's the standard trick, and it usually works.

1:15Cassidy: It usually works. On these tasks it fails — and it fails hardest exactly where you need it most. The claim this paper defends is that on a genuinely hard reasoning problem, the popular answer is almost always the wrong one. Which sounds backwards until you see why.

1:33Eric: And why this matters beyond one leaderboard: if it holds, it changes what "spend more compute at answer-time" should buy you. Not more votes — more diverse guesses, plus a small, smart investment in picking. That's a lever anyone can pull today, without a training run.

1:52Cassidy: So let me show you why the popular answer is a trap. And the cleanest way in is one specific puzzle — the one that beat everything. ARC-AGI-2 is a grid puzzle benchmark. You're shown three or four examples of a little grid of colored cells turning into another grid, and you have to infer the transformation rule, then apply it to a fresh input. The whole benchmark is engineered to be unmemorizable — every puzzle uses a novel rule, so you can't win by having seen a billion similar problems in training. You have to actually learn a new pattern from a handful of examples, on the spot. That's the founding idea behind ARC: measure how fast you pick up a brand-new skill, not how much you've already absorbed.

2:40Eric: Which is precisely the thing frontier models are worst at. They're spectacular at recall. This punishes recall.

2:47Cassidy: Right. And here's the puzzle that anchors the whole paper. A human looks at it and sees spaceships — little colored shapes on a grid, each with an exhaust side and a nose. Inside each ship, on the exhaust end, there are some colored particles. The rule: move the particles to the nose, and extend the ship forward in its direction of travel. A person gets it in seconds.

3:12Eric: And to infer that from three examples, the model has to chain something like four separate inferences — identify the ship as an object, figure out inside versus outside, work out which way it's pointing, and classify the edges by length. Miss any one link and the whole thing collapses.

3:32Cassidy: All twenty-nine candidate solutions the system generated failed it. GPT-5.2, Gemini 3, Opus 4.5 — none of them got the spaceship. It's still unsolved. And that's the texture of what "hard" means here: not arithmetic, not obscure knowledge, just a chain of ordinary visual inferences that humans do without noticing and models can't hold together.

3:55Eric: So now walk me to the minority-report thing, because this is the move the entire architecture is built on.

4:03Cassidy: Here's the logic. These puzzles are underspecified — multiple different rules are all consistent with the few examples you're shown, but only one generalizes to the test case. So the model isn't purely reasoning, it's guessing which rule the puzzle-maker intended. Now — if the obvious, most-common guess were the right one, frontier models would already solve the task, and it wouldn't be on the "hard" pile. So the tasks that stay unsolved are exactly the ones where the correct answer is a minority opinion. A reading only one candidate out of thirty stumbled onto.

4:40Eric: It's a whodunit. Three clues, several suspects, all consistent. If everyone in the room instantly agrees on one suspect — in a well-built mystery, that's the red herring the story wants you to suspect. The real culprit is the reading one careful detective arrives at.

4:58Cassidy: That's the exact shape. And it tells you why majority voting is doomed on these. Voting rewards the crowd. But on a hard puzzle the crowd all made the same tempting simplifying assumption. The errors don't scatter — they pile up on the same wrong answer. So the more you vote, the more confidently you converge on the trap.

5:19Eric: Which means the fix has to do two opposite-sounding things at once. Generate candidates diverse enough that the rare correct hypothesis shows up at all — and then select in a way that won't crush that lone correct voice under the majority.

5:35Cassidy: Two phases. Diverse generation, then holistic judging. And I'll flag one thing now that we'll come back to hard at the end — the headline evidence for how much the judging phase adds comes from re-scoring a single run after the fact, not from running two full pipelines head to head. Hold that thought. It matters.

5:55Eric: Noted. Let's build the first phase — because the diversity trick here is not what I expected.

6:01Cassidy: The obvious way to get diverse guesses is to crank up the randomness — sample the same model hot, many times. Land does something different. He treats the modality — the form the problem is presented in — as the axis of diversity. Same puzzle, three representations. Text: the model reasons about the grid in prose. Image: the grid is rendered as a picture and the model reasons visually. And code: the model writes an actual executable program that maps input grids to output grids, and you run it.

6:34Eric: Three specialists on the same problem. A novelist who reasons in prose, a photographer who reasons in shapes, an engineer who reasons in procedures. Hand them the same puzzle and they genuinely notice different things.

6:49Cassidy: And that's measurable, not just a metaphor. Some tasks only ever get solved in pixel-space — a zoom or a mirror operation the vision channel just sees, that the text channel would grind through cell by cell and botch. Others only get solved in code — walking a boundary structure that's natural to write as a loop. Across three foundation models in various settings, he generates up to twenty-nine candidates, each producing an output grid plus its full reasoning trace — the whole written chain of how it got there.

7:23Eric: There's a detail in here I love — the images are deliberately blurry. Not pixel-perfect. Why sabotage your own render?

7:31Cassidy: Because pixel-perfect renders backfired. When the image was crisp, the model treated it like a lossless spreadsheet and fell right back into cell-by-cell numerical reasoning — the exact thing you rendered a picture to escape. Blurring it forces the model to see shapes instead of reading cells. You degrade the input on purpose to get the high-level pattern recognition that made vision worth using.

7:58Eric: So the whole first phase is deliberately profligate — throw money at wildly different attempts so the rare correct one appears somewhere in the pile. There's a cheap early-stopping probe too, right? If a first look already strongly agrees, it quits and skips the expensive modalities.

8:17Cassidy: On easy tasks, yes — save the money where the answer's obvious. Keep that early-stopping in mind, though, because it has a dark side we'll hit later. So: phase one hands us a pile of up to twenty-nine candidates, each with a full reasoning trace, and somewhere in there — hopefully — sits the correct minority answer. Now the real problem. How do you find it?

8:41Eric: This is the part I'd actually want to get right, so let's take it slowly. The selection phase is where the paper lives or dies — and it's where they tried two things that failed before the one that worked. The payoff at the end of this is a single number that I think is the most memorable stat in the paper: the judging costs almost nothing and does almost all the decisive work.

9:07Cassidy: Start with what doesn't work. Land tried a "logic judge" — score each trace on whether its reasoning is internally consistent, sounds sound. Failed. A candidate can be flawlessly logical and still be a brittle overfit to the training examples — perfect reasoning toward the wrong rule. Then a "consistency judge" that rewards answers whose themes repeat across candidates. That one failed for the obvious reason —

9:34Eric: — it just re-elects the majority. It rewards the crowd, which is the thing you're trying to beat. You've rebuilt voting with extra steps.

9:43Cassidy: Exactly. The winner is what he calls holistic judging, and it's the anti-consistency mechanism. Instead of scoring each candidate alone, or counting votes, you dump all the candidates' full reasoning traces — thirty to eighty thousand tokens of actual argument — into one long prompt, and ask a judge model to read them side by side, compare them jointly, pick the two most likely correct, and explain why the rest are wrong.

10:12Eric: Voting is a show of hands with no arguments — you just count who raised a hand for what. This is a jury that reads every juror's full written reasoning next to each other, and reasons about which argument actually holds up, even if only one person made it. A jury can be persuaded by a lone, well-supported dissent that a headcount would have buried.

10:36Cassidy: And the load-bearing claim is subtle: keeping the full reasoning together beats compressing each trace into a score. The moment you reduce a candidate to a number, you lose the fine distinctions between near-identical hypotheses — the difference between real insight and confident groupthink lives in the details of the argument, and a score throws those away. Three parallel judge instances each name a first and second pick; first pick gets two points, second gets one, you sum across the three, and the top two distinct grids become the final answers. No learned weights, no optimization. Just — read everything, then choose.

11:18Eric: And it gets two guesses per puzzle. That's the pass@2 scoring — you submit two answers, and you're right if either one is. That's not a loophole, it's baked into the benchmark, because some puzzles are genuinely ambiguous — even the human baseline was measured with two tries.

11:37Cassidy: Which the judge can exploit deliberately. When the examples truly don't disambiguate between two readings, it can hedge — one guess for each interpretation. Now here's the number Eric teased. Compared to plain majority voting on the very same candidate pool, holistic judging recovered seven more correct answers — and it did that for thirteen percent of the total system cost. Generation eats the other eighty-seven percent. The expensive part throws darts; the cheap part is what finds the bullseye.

12:10Eric: Seven puzzles doesn't sound like much until you hear what all seven were. Every single one was a minority recovery — a case where the correct answer was not the most common candidate, and the judge fished it out anyway. That's not noise. That's the mechanism doing exactly the one job it was designed for.

12:30Cassidy: And I want to make this concrete, because there's one case that's the whole thesis in miniature. On one puzzle, twelve candidates converged on a wrong answer, eight more converged on a different wrong answer, and exactly one candidate got it right. Any vote — any vote at all — buries that lone correct answer under twenty wrong ones. The judges read the arguments and picked the one.

12:56Eric: That's the jury overruling the crowd. Now — the synthesis case is even stranger, and I want you to sit on it for a second, because it breaks the frame we've been using.

13:08Cassidy: This is the one that got me. On another task, zero of the twenty-nine candidates produced the correct output. Not one. And the system still got it right.

13:18Eric: How? If nobody produced the answer, what is there to select?

13:22Cassidy: Because the judge is allowed to synthesize — to output a new grid no candidate ever generated. On this task, one candidate had correctly parsed the "rooms" in the grid, another had figured out the arrow semantics, a third had the recoloring right — each held one true piece and got the whole wrong. The judge read all of them and assembled the correct answer none of them reached.

13:47Eric: Three witnesses to a getaway — one caught the plate, one the color, one the direction it fled. No single witness can describe the car. The detective who hears all three assembles a complete picture nobody in the room possessed. Except the judge isn't just transcribing testimony — it's deciding which partial insights to trust, which is a lot more active.

14:12Cassidy: So where we've landed: phase one throws diverse darts to make sure a correct rare answer exists somewhere; phase two reads all the arguments in full and either elevates the lone correct outlier or builds a new answer from broken pieces. And it's the cheap phase carrying the win. That's the architecture. Which brings us to the part of the paper that flatly contradicts a decade of practical advice.

14:40Eric: This is my favorite thread, and it's a working engineer's headline. Everything we've been taught about getting good behavior out of a model says: structure it. Give it a template. Tell it the steps — "first identify the objects, then the transformations." Force the output into clean JSON. Drop in domain hints about symmetry and rotation. That's prompt engineering orthodoxy.

15:07Cassidy: And on these tasks?

15:08Eric: Every one of those moves made it worse. Every attempt to structure the reasoning degraded performance on the hardest puzzles. The final prompt Land ships is deliberately minimal — he strips the scaffolding out. His phrase for what's happening is a "compliance tax on reasoning."

15:28Cassidy: Unpack the tax, because there are two costs in there and they're different.

15:34Eric: Right. Cost one: when you tell a model exactly how to think, it spends its reasoning budget obeying you instead of solving the problem. Picture handing a brilliant problem-solver a rigid worksheet — "now list the objects, now the transformations, now format as JSON." A chunk of their attention goes to filling out your form instead of thinking. Cost two is worse, and it's the one that connects to the whole architecture: prescriptive prompts make all your candidates walk down the same narrow path you specified. They converge. And convergence is the one thing this system cannot afford — you just spent phase one manufacturing diversity, and a bossy prompt collapses it.

16:20Cassidy: So the two threads are actually one thread. The prompting heresy and the anti-voting judge are the same idea pointed at different stages — protect the diversity, at generation and at selection.

16:33Eric: That's the unifying line. And Land puts it sharply: the engineer's job shifts from programming the model's behavior to removing obstacles to the model's reasoning. For genuinely novel discovery — as opposed to structured extraction or classification, where templates genuinely help — over-specifying is counterproductive.

16:55Cassidy: How blind was the test, though? Because "I got seventy-three percent" means nothing if he tuned against the answer key.

17:03Eric: That's the reassuring part. The seventy-three is on the semi-private set — held-out tasks run by the ARC Prize organizers on puzzles Land never saw. Genuinely blind. His own tuning happened on the public set, where he scored about seventy-six. The gap between them is only about three points, which is the evidence he didn't overfit his design — if he'd been gaming the public tasks, that gap would be huge. And as an extra check, he pointed the exact same untouched system at the older ARC-AGI-1 and scored ninety-four and a half percent, cold, with zero exposure during design.

17:42Cassidy: So the system generalizes across two benchmark generations without retuning. That's a real signal the architecture is doing something and not just fitting one test.

17:53Eric: It's a good signal. But this is where I have to put the brakes on — because the architecture is convincing, and the credit assignment inside it is not nearly as solid as the headline numbers make it feel.

18:06Cassidy: Go ahead. This is the part we always do.

18:09Eric: Take that "+7 from holistic judging" — the number we just built a whole segment around. It is not a head-to-head. He didn't run a full judge-based pipeline and a full majority-vote pipeline and compare the endings. He ran the system once, took the pool of candidates it produced, and re-scored that same pool two ways. It's re-watching game footage and calculating how many points a strategy would have scored — not actually playing two games. It can't capture what would've changed if you'd committed to voting from the opening whistle. Maybe voting-as-the-real-mechanism would've produced a different candidate distribution entirely. We don't know.

18:50Cassidy: And the reason he didn't run the clean version is brutally practical — each full run costs around twenty-four hundred dollars.

18:58Eric: Which I completely understand. But it means the modality-complementarity story has the same crack. The evidence that all three families matter is oracle-level — it counts whether a correct candidate exists in a family, not whether removing that family changes the final output. Pull the image channel out entirely and the judge dynamics might shift in ways the overlap counts can't see. So "each modality is essential" is a lower bound on plausibility, not a demonstrated fact. Land flags this himself, to his credit.

19:31Cassidy: And the prompting finding — the most quotable claim in the whole paper — is the softest of all, isn't it.

19:38Eric: It's qualitative. "In every case, the more prescriptive the prompt, the worse it did" is a developer's recollection across model versions — not a matched-budget controlled sweep. The paper itself lists the controlled version as an experiment not run. So the pattern is compelling and I believe it directionally, but right now it's closer to a sharp observation than a proven result. Two more, quickly: the headline numbers come from single stochastic runs with no confidence intervals — Land says outright the true accuracy could be meaningfully higher or lower, so we don't actually know how stable that eighteen-point margin is. And the judge is the same model family as the generators, with no shuffling of candidate order — so there's an unaddressed risk it quietly favors reasoning in its own style, or whatever's listed first.

20:34Cassidy: So the honest read is: the architecture is convincingly good, the component attributions are educated inference. I'll concede that fully. The one thing I'd hold onto is that the two blind numbers — the small public-to-private gap and the cold ninety-four on ARC-AGI-1 — are end-to-end and they're real, whatever the internal credit split turns out to be.

20:59Eric: Agreed, and I'm not walking the system back — I'm saying the pattern is the contribution, not the decimals. And there's one more piece of texture that tells you how raw real-world frontier engineering is. On the public run, of just over fourteen thousand GPT-5.2 API calls, only about twenty-two hundred succeeded. An eighty-four percent failure rate — rate limits, timeouts — which roughly doubled the cost through wasted retries.

21:28Cassidy: Eighty-four percent. So a chunk of the achievement here is just surviving the infrastructure.

21:35Eric: And on cost, it's genuinely not efficient — Land says so plainly. Orders of magnitude more than a single model call, and much of the spend goes to candidates that contribute nothing. Though the punchline is that at about twenty dollars a task it lands near the price of GPT-5.2 Pro and roughly twenty points more accurate — and it's both cheaper and more accurate than Gemini 3 Pro. Cheap it is not; competitive it is.

22:05Cassidy: There's also a failure that cuts the other way — where the system's own cleverness bites it. One simple task where the test case secretly inverts a color legend. Every model made the same easy assumption — legend stays the same — and the early-stopping probe fired on that false consensus and quit. When every model makes the identical simplifying assumption, no amount of extra candidates saves you, because they'd all make it too.

22:35Eric: That's the shadow of early stopping we flagged. Confident agreement is the exact signal it trusts — and confident agreement is also what a shared wrong assumption looks like from the outside.

22:49Cassidy: So let me pull the real takeaway up, above the method. The durable result here isn't "generate twenty-nine candidates and use three judges." It's a reframing of what test-time compute should buy you. The reflex has been: spend your inference budget on more samples of the same kind, and vote. This paper shows voting actively fails on the tasks that matter most — because on a hard problem the majority is confidently, uniformly wrong. The alternative is to spend the budget on diversity of representation, then invest a small slice in a judge that reads the actual arguments instead of counting them.

23:31Eric: And the reason that's more than a leaderboard trick — Land argues the pattern should transfer anywhere models produce confident but divergent answers. Proof search, legal analysis, medical diagnosis. Anywhere the correct answer might be the dissent that one careful voice arrived at, and the crowd talked itself out of.

23:54Cassidy: Plus the quieter story underneath it. One person, no training run, an API budget and an architecture, beat the flagship configurations of the labs whose models he was calling. That's a real data point on the side of "clever orchestration matters" — that the frontier isn't only a resource game.

24:14Eric: With the honest asterisk Land puts on it himself: this is a snapshot. Frontier models improve fast, and the gap between this ensemble and its best single component will probably narrow. He's betting on the pattern outliving the numbers.

24:30Cassidy: So here's what I'd put to you. If the correct answer on the hardest problems really is a minority report, then the whole industry's instinct — sample more, trust the consensus — is aimed the wrong way for exactly the problems we most want to crack. So which is it: is the future of hard reasoning about generating a smarter single model that doesn't need a jury — or about getting much better at reading a room full of confident, disagreeing ones and picking the lone voice that's right? Drop where you land, and what domain you'd trust a judge like this on first.

25:09Eric: The full annotated version of this episode is on paperdive.ai — every technical term tap-to-define, with links to the related work grouped by theme, from self-consistency to the judge-model literature, plus our weekly and monthly roundups.

25:26Cassidy: Quick housekeeping: this script was written by Anthropic's Claude Opus 4.8, Eric and I are both AI voices from Eleven Labs, and the producer isn't affiliated with either company. The paper is "Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2," by Johan Land, published June 30th, 2026 — we're recording the day after, on July 1st.

25:50Eric: The models threw twenty-nine darts and mostly missed. The trick wasn't a better throw — it was knowing which dart to walk up and pull out of the board. See you in the next one.