All episodes

Episode 003 · May 01, 2026 · 17 min

How to Pick the Best of Sixteen Coding Agent Rollouts

Kim, Yang, Niu et al.

Test-time Scaling Agentic AI Systems

AI Papers: A Deep Dive — Episode 003: How to Pick the Best of Sixteen Coding Agent Rollouts — cover art

paperdive.ai

Listen

Ep. 003

How to Pick the Best of Sixteen Coding Agent Rollouts

0:00

17 min

Concepts in this episode

Agentic AI Evaluation & Benchmarks Training Methods Test-Time Compute Agentic Coding SWE-bench Rollout Summarization Tournament Voting LLM-as-Judge Parallel Sampling Iterative Refinement Context Quality Long-Horizon Tasks

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

Scaling Test-Time Compute for Agentic Coding

Venue

arXiv:2604.16529

Year

2026

Read the paper

arxiv.org/abs/2604.16529

Also available on

Apple Podcasts Spotify

When an AI coding agent takes forty steps and tens of thousands of tokens to fix a single bug, running sixteen attempts in parallel is easy — picking the winner is the hard part. A new paper from Meta Superintelligence Labs argues the real bottleneck in agentic test-time scaling isn't compute, it's representation: you can't select what you can't compare, and you can't reuse what you can't summarize.

What you'll take away

Why classic test-time scaling tricks like majority voting break down when the unit of work is a 40,000-token interactive session
How Recursive Tournament Voting uses pairwise bracket-style judging on compressed rollout summaries to pick a winner — and why pairwise beats flat ranking
The near-deterministic finding that the quality of priors passed to a second wave of attempts essentially determines whether those attempts succeed
Concrete gains: 6–16 percentage points on SWE-Bench Verified and Terminal-Bench v2 across Claude and Gemini, plus a 3x drop in steps-per-attempt after refinement
Where the pipeline gets worse: refinement is a redistribution, not a strict improvement — more tasks become uniformly solvable, but more also become uniformly unsolvable
Why the judge being the same model as the generator is the load-bearing weakness, and why a dedicated trained judge is the obvious next step

Chapters

00:00Why voting fails for agentic rollouts
02:08Summarization as the load-bearing move
04:16Recursive Tournament Voting explained
06:24Parallel-Distill-Refine and the relay race
08:33The headline numbers and step efficiency
10:41The context-quality finding that justifies the architecture
12:49Steelman: where the pipeline is fragile
14:58Representation, not compute, as the new frontier

References in this episode

Self-Consistency Improves Chain of Thought Reasoning in Language Models — The canonical majority-voting test-time scaling paper whose 'vote on the answer'
Self-Refine: Iterative Refinement with Self-Feedback — The classic single-trajectory refinement method that R-T-V and P-D-R generalize
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? — The benchmark behind the episode's headline numbers, useful for understanding wh
Large Language Models are not Fair Evaluators — Direct evidence on the judge-reliability concern Finn raises — LLM judges have s

Full transcript

Also available as a plain-text transcript page.

0:00Hope: Picture this. You give an AI coding agent a real bug from a real GitHub repo and tell it, fix this. It opens the codebase, starts grepping around, makes a guess, runs the tests, hits an error, tries something else, installs a missing dependency, runs the tests again, finds its fix didn't actually work, takes a different approach. Forty steps later, it either submitted a patch or it didn't. Now run that same task sixteen times in parallel. You've just spent a lot of compute. Sixteen separate attempts, each one a sprawling forty-step odyssey, each one tens of thousands of tokens of interleaved actions, terminal output, and partial reasoning. Here's the question the paper we're talking about today is built around: how do you pick the best one?

0:50Finn: And the obvious answer is to vote on it - the way majority voting is done for math problems — but this completely falls apart. There's nothing to vote on. There's no clean answer at the end. There's a sixteen-thousand-token transcript of an interactive session.

1:07Hope: Right. And you can't fit sixteen of those into a single context window even if you wanted to. So this is a paper from Meta Superintelligence Labs and a bunch of academic collaborators — it's called "Scaling Test-Time Compute for Agentic Coding," posted to arXiv in mid-April twenty-twenty-six, and we're recording on May first. Quick note before we dig in: this episode is AI-generated. The script comes from Anthropic's Claude Opus 4.7. I'm Hope and my co-host is Finn — we're both AI voices from Eleven Labs, and the show isn't affiliated with either company. With that said — the reason this paper is worth a full episode is that it doesn't just patch the voting problem. It reframes what test-time scaling actually is when the unit of work is no longer an answer but an entire interactive session.

2:00Finn: And the reframing is the part I want to keep coming back to. For the last couple of years, "test-time scaling" has been an incredibly reliable lever. You've got a math problem, you sample ten answers, you take the majority. You've got a short coding puzzle, you generate a draft and ask the model to critique it. These recipes work. They've been part of the standard playbook. The thing this paper is saying — and I think they're right — is that all of those recipes share an assumption. The assumption is that the model's output is small enough and clean enough that you can compare outputs directly, or feed one back as the next input. Once you're in agentic territory, that assumption is just gone.

2:44Hope: The image they use, implicitly, is a sports tournament. Sixteen teams, single-elimination bracket, four rounds, one champion. Except the teams are coding attempts, and there's no scoreboard. There's just a judge — the same model that made the attempts in the first place — reading two attempts side by side and picking who did better. That's their parallel scaling method. They call it Recursive Tournament Voting, RTV for short.

3:10Finn: And the kicker, before we even get to how it works, is the prerequisite. You can't have a judge read two forty-thousand-token rollouts and reliably pick the better one. So before any tournament happens, every rollout — winning attempts and losing attempts both — gets compressed into a structured summary. What did this attempt try? What did it observe? What hypotheses did it form? What worked? What failed? Once you've got that summary, comparison becomes tractable.

3:40Hope: That's the load-bearing move. The summarization step. Once I saw it, the rest of the paper kind of clicked into place. There's an analogy here that I think nails it: think about how science actually works. A chemist doesn't pass forward instrument printouts and timestamps. They pass forward the write-up. Hypothesis, procedure, what worked, what didn't, what to try next. The raw data is unusable as input for the next experiment. The notebook entry is what propagates. These structured summaries are doing exactly that for coding rollouts.

4:19Finn: OK. Walk me through the tournament.

4:23Hope: Sixteen attempts in parallel. You pair them up: eight matches in round one. Each match, the judge reads the two summaries and picks a winner. They actually run that judgment eight separate times per match, with majority rule, because you want the comparison to be reliable. Eight winners advance. Round two, four matches, four winners. Round three, two matches. Round four, one match, one champion. That's RTV.

4:53Finn: One of the cleaner empirical results in the paper is that the bracket structure itself matters. They tried the obvious alternatives — just give the judge all sixteen summaries at once and ask which is best. Or split them into groups of eight and rank within each. Or groups of four. The pairwise version, the bracket, beats all of them.

5:17Hope: Which is intuitive once you say it out loud, but it's the kind of thing where you want to see the data. The intuition is that comparing two things is a much easier judgment task than ranking sixteen. A flat ranking forces the judge to hold everything in its head and produce a global ordering. The bracket replaces one hard global decision with many easy local ones. Same as why human judges in basically any domain prefer A-versus-B over rank-these-twenty.

5:51Finn: The second mechanism is where I think the real magic is. The tournament doesn't just stop at one champion. They take the top four summaries from the first wave, and then they launch a fresh wave of sixteen new attempts. Brand new agent, brand new environment. Before this new agent takes a single action, it reads those four summaries. So the new attempts inherit context: here's what four previous attempts thought about this bug, what they tried, where they got stuck, what they ruled out.

6:22Hope: It's the relay race image. First wave of runners takes the course. Before the second wave starts, the four most informative runners write down what they learned — where the course gets tricky, what shortcuts didn't work, where they wasted time. The second wave reads those notes before the gun goes off. They call this Parallel-Distill-Refine, PDR. Then those second-wave attempts go through their own tournament, and you get one final winner.

6:50Finn: So the full pipeline, end to end: sixteen attempts, tournament down to four, sixteen new attempts that have read those four summaries, tournament down to one. Thirty-two rollouts per problem, plus all the judging.

7:03Hope: And the headline numbers are real. On SWE-Bench Verified — which is the standard benchmark of actual GitHub bugs from open-source Python projects — Claude four-point-five Opus jumps from about seventy-one percent to about seventy-eight. On Terminal-Bench v-two, the harder command-line benchmark with stuff like "recover this corrupted SQLite file," the same model goes from forty-seven to fifty-nine. Gemini three-point-one Pro on Terminal-Bench: fifty-three to sixty-five. Claude four-point-five Sonnet on Terminal-Bench: forty-one to fifty-seven. These are six to sixteen percentage points of accuracy, on top of frontier models, with no training, just clever inference.

7:45Finn: Which is the kind of gap that usually comes from a model generation upgrade. It's a lot. But it's also expensive, and we should be honest about that. Thirty-two rollouts per problem, plus tournament voting, plus summarization calls. This is multi-x the compute of a single attempt. They're trading a lot of inference for those points.

8:05Hope: Yeah. The cost is real and they don't bury it. But there's a finding inside the paper that I think is the single most interesting thing in the whole experimental section — and it's the one that justifies the architecture, in my read. They take all the second-wave attempts, the ones that read four prior summaries before starting. They bucket those tasks by how many of the four priors actually solved the problem. Like — were the priors any good?

8:32Finn: And the result is just — almost too clean. When zero out of four priors got it right, the new attempts succeed essentially zero percent of the time. Maybe one or two percent. When four out of four priors got it right, the new attempts succeed almost a hundred percent of the time. Ninety-seven, ninety-eight, ninety-nine. The in-between buckets walk smoothly between those two extremes.

8:55Hope: The quality of the prior context essentially determines whether the next attempt succeeds. It's a near-deterministic relationship.

9:04Finn: Once you really sit with that, the entire architecture stops being clever and starts being kind of forced. Because what they've shown is that whatever you give the next wave of attempts to read — that's the ceiling. If you give them garbage, you get garbage. Give them a good seed, you get success. So now the whole question of test-time scaling reduces to: how do you make sure the priors you pass forward are the good ones? That's exactly what the tournament is for.

9:32Hope: The tournament isn't decoration. It's the filtration step that determines whether refinement amplifies signal or amplifies noise. Without RTV picking the priors, PDR is a coin flip on whether the second wave learns from successes or doubles down on the wrong diagnosis.

9:50Finn: There's one more empirical thing I want to flag, because I think it's underplayed in the paper. The step efficiency. The refined attempts — the ones in the second wave that read prior summaries — don't just succeed more often. They succeed faster. Way faster.

10:06Hope: How much faster?

10:07Finn: For Claude four-point-five Opus on SWE-Bench, the average attempt drops from forty-one steps to fourteen. Gemini three-point-one Pro drops from thirty-six to eighteen. The agent isn't re-grepping the directory structure. It isn't rediscovering which file the bug lives in. It isn't doing the dependency-installation dance for the third time. The summary tells it where to go, and it goes there.

10:33Hope: There's a great concrete example in the appendix. A Django bug. The first wave of attempts kept hitting - "Module-Not-Found-Error" - mid-run because of missing packages. The refined attempt opens by saying, in effect: based on analysis of four prior attempts, I'll preemptively install asgiref, pytz, and sqlparse — and then it goes directly to the buggy line and applies the fix in ten steps instead of forty.

11:00Finn: Which is the right way to think about what the summaries are actually doing. They're not evaluation criteria. They're inherited operational knowledge. The agent in the second wave is starting from a much more advanced point in the problem-solving process.

11:17Hope: And it makes the cost picture a little less ugly. Yes, you're running thirty-two rollouts. But the second sixteen are each shorter than a baseline rollout would have been. So the multi-x is real but not as bad as you'd think from "thirty-two attempts."

11:33Finn: OK. I want to push back on this, because I do think the paper is good, but there's a steelman case I want to put on the table. The biggest one is the judge.

11:43Hope: Yeah, Finn, this is the place I'd push too.

11:45Finn: The judge in RTV is the same frontier model that produced the rollouts. They report judge accuracy in one of the tables — how often does the judge actually pick the better attempt when we know which one was better? — and at the final rounds it's running between sixty and eighty percent. That's a lot of error, and it's correlated error, because the judge has the same blind spots as the generator. If the model is systematically wrong about a class of bug, it'll be just as wrong evaluating attempts on that bug as it was producing them.

12:19Hope: There's a footnote about Gemini three-point-one Pro specifically having unusually low judge accuracy and correspondingly weaker final improvements, which is a hint that the whole pipeline is bottlenecked on judgment quality. A separately-trained dedicated judge — the kind of thing you'd build with supervised fine-tuning or reinforcement learning on actual rollout-quality labels — is the obvious next step. They explicitly say that. But it isn't tested.

12:47Finn: The second thing I'd flag, Hope, is the bimodal collapse. The paper celebrates that more tasks reach sixteen-out-of-sixteen passing after refinement, and that's true — for Claude on SWE-Bench, the number of tasks where every single attempt succeeds jumps from two-oh-nine to three-fifty. But the number of tasks where every single attempt fails also goes up. From seventy-three to ninety-four.

13:12Hope: Right. Refinement isn't a strict improvement. It's a redistribution. For tasks where the initial attempts were mediocre, the tournament can pick the wrong four to pass forward, and now the second wave is anchored on bad priors. So they flail harder.

13:27Finn: Which connects directly to the context-quality finding. If the priors are garbage, the next wave is garbage, deterministically. The aggregate average rises, the variance gets worse, and for any specific hard task, you might be worse off than just running sixteen attempts and crossing your fingers.

13:47Hope: There's also the framing of "long-horizon agentic coding." That phrase is doing a lot of work. They tested two benchmarks — SWE-Bench and Terminal-Bench. Both are English-language coding tasks, both have discrete pass-fail outcomes from automated test suites, both fit on a single machine. Whether this same recipe transfers to multi-day software engineering work, web agents, scientific research agents, things without crisp test outcomes — that's genuinely open.

14:19Finn: The summaries themselves are also a free variable. The whole pipeline rests on the assumption that the summary preserves the right information. But the summary is generated by the same model under a fixed prompt. They don't ablate summarization quality. If the model is bad at summarizing on some task family — and it's plausible it would be, since summarization is itself a skill that varies by domain — the substrate of the whole apparatus quietly degrades.

14:50Hope: All those critiques are real. None of them undo the central reframe, though, which is the part I keep coming back to. For two years, the field's mental model of test-time scaling has been "more compute equals more capability." Sample more, think longer, make the context window bigger. This paper is making a different claim. For long-horizon agents, raw compute hits a wall. Sixteen rollouts of forty thousand tokens each is a lot of compute. But if you can't aggregate them — if you can't compare them, can't summarize them, can't pass useful pieces of them forward — that compute is wasted. The bottleneck isn't generation. It's representation.

15:34Finn: And that's a different research agenda. If the bottleneck is representation, then the next frontier isn't longer contexts or smarter base models. It's better experience artifacts. What does an agent's lab notebook look like? Should it be textual summaries? Persistent code files? Derived test cases? Reusable tools the agent built for itself in an earlier attempt and now keeps around? The paper gestures at this in its conclusion. Today, prior experience flows through summaries. Tomorrow, it could flow through accumulating workspace state.

16:08Hope: Which makes this paper feel less like a technique and more like a marker. It's the place the field stops asking "how do we make a single forward pass smarter" and starts asking "how do we make a sequence of attempts collectively smarter than any one of them." That's much closer to how human teams work than how individual humans think.

16:30Finn: And the deepest line in the paper, the one I'd actually quote — it's something like: you can't select what you can't compare, and you can't reuse what you can't summarize. That's the whole thesis in one sentence.

16:43Hope: It is. And once you've heard it, it's hard to unhear.

16:46Finn: The paper is from Meta Superintelligence Labs and collaborators across UW, NYU, Carnegie Mellon, Princeton, and Google DeepMind, posted in April. We'll link it in the show notes alongside the related background reading on Self-Refine, Self-Consistency, and the original Parallel-Distill-Refine work.

17:05Hope: Thanks for listening to AI Papers: A Deep Dive.

How to Pick the Best of Sixteen Coding Agent Rollouts

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes