All episodes
Episode 003 · May 01, 2026 · 17 min

How to Pick the Best of Sixteen Coding Agent Rollouts

Kim, Yang, Niu et al.

Test-time Scaling Agentic AI Systems
AI Papers: A Deep Dive — Episode 003: How to Pick the Best of Sixteen Coding Agent Rollouts — cover art
paperdive.ai
Ep. 003
How to Pick the Best of Sixteen Coding Agent Rollouts
0:00
17 min
Paper
Scaling Test-Time Compute for Agentic Coding
Venue
arXiv:2604.16529
Year
2026
Read the paper
arxiv.org/abs/2604.16529
Also available on
Apple Podcasts Spotify

When an AI coding takes forty steps and tens of thousands of to fix a single bug, running sixteen attempts in parallel is easy — picking the winner is the hard part. A new paper from Meta Superintelligence Labs argues the real bottleneck in agentic isn't compute, it's representation: you can't select what you can't compare, and you can't reuse what you can't summarize.

What you'll take away

  • Why classic tricks like majority voting break down when the unit of work is a 40,000- interactive session
  • How uses pairwise bracket-style judging on compressed summaries to pick a winner — and why pairwise beats flat ranking
  • The near-deterministic finding that the quality of priors passed to a second wave of attempts essentially determines whether those attempts succeed
  • Concrete gains: 6–16 percentage points on SWE-Bench Verified and v2 across and , plus a 3x drop in steps-per-attempt after refinement
  • Where the pipeline gets worse: refinement is a redistribution, not a strict improvement — more tasks become uniformly solvable, but more also become uniformly unsolvable
  • Why the judge being the same model as the generator is the load-bearing weakness, and why a dedicated trained judge is the obvious next step

Chapters

  1. 00:00Why voting fails for agentic rollouts
  2. 02:08Summarization as the load-bearing move
  3. 04:16Recursive Tournament Voting explained
  4. 06:24Parallel-Distill-Refine and the relay race
  5. 08:33The headline numbers and step efficiency
  6. 10:41The context-quality finding that justifies the architecture
  7. 12:49Steelman: where the pipeline is fragile
  8. 14:58Representation, not compute, as the new frontier

References in this episode

Also available as a plain-text transcript page.

0:00Hope: Picture this. You give an AI coding a real bug from a real repo and tell it, fix this. It opens the codebase, starts grepping around, makes a guess, runs the tests, hits an error, tries something else, installs a missing dependency, runs the tests again, finds its fix didn't actually work, takes a different approach. Forty steps later, it either submitted a patch or it didn't. Now run that same task sixteen times in parallel. You've just spent a lot of compute. Sixteen separate attempts, each one a sprawling forty-step odyssey, each one tens of thousands of of interleaved actions, terminal output, and partial reasoning. Here's the question the paper we're talking about today is built around: how do you pick the best one?

0:50Finn: And the obvious answer is to vote on it - the way majority voting is done for math problems — but this completely falls apart. There's nothing to vote on. There's no clean answer at the end. There's a sixteen-thousand- transcript of an interactive session.

1:07Hope: Right. And you can't fit sixteen of those into a single even if you wanted to. So this is a paper from Meta Superintelligence Labs and a bunch of academic collaborators — it's called "Scaling Test-Time Compute for Agentic Coding," posted to arXiv in mid-April twenty-twenty-six, and we're recording on May first. Quick note before we dig in: this episode is AI-generated. The script comes from Anthropic's . I'm Hope and my co-host is Finn — we're both AI voices from Eleven Labs, and the show isn't affiliated with either company. With that said — the reason this paper is worth a full episode is that it doesn't just patch the voting problem. It reframes what actually is when the unit of work is no longer an answer but an entire interactive session.

2:00Finn: And the reframing is the part I want to keep coming back to. For the last couple of years, "" has been an incredibly reliable lever. You've got a math problem, you sample ten answers, you take the majority. You've got a short coding puzzle, you generate a draft and ask the model to critique it. These recipes work. They've been part of the standard playbook. The thing this paper is saying — and I think they're right — is that all of those recipes share an assumption. The assumption is that the model's output is small enough and clean enough that you can compare outputs directly, or feed one back as the next input. Once you're in territory, that assumption is just gone.

2:44Hope: The image they use, implicitly, is a sports tournament. Sixteen teams, single-elimination bracket, four rounds, one champion. Except the teams are coding attempts, and there's no scoreboard. There's just a judge — the same model that made the attempts in the first place — reading two attempts side by side and picking who did better. That's their parallel scaling method. They call it , RTV for short.

3:10Finn: And the kicker, before we even get to how it works, is the prerequisite. You can't have a judge read two forty-thousand- and reliably pick the better one. So before any tournament happens, every rollout — winning attempts and losing attempts both — gets compressed into a structured summary. What did this attempt try? What did it observe? What hypotheses did it form? What worked? What failed? Once you've got that summary, comparison becomes tractable.

3:40Hope: That's the load-bearing move. The summarization step. Once I saw it, the rest of the paper kind of clicked into place. There's an analogy here that I think nails it: think about how science actually works. A chemist doesn't pass forward instrument printouts and timestamps. They pass forward the write-up. Hypothesis, procedure, what worked, what didn't, what to try next. The raw data is unusable as input for the next experiment. The notebook entry is what propagates. These structured summaries are doing exactly that for coding .

4:19Finn: OK. Walk me through the tournament.

4:23Hope: Sixteen attempts in parallel. You pair them up: eight matches in round one. Each match, the judge reads the two summaries and picks a winner. They actually run that judgment eight separate times per match, with majority rule, because you want the comparison to be reliable. Eight winners advance. Round two, four matches, four winners. Round three, two matches. Round four, one match, one champion. That's .

4:53Finn: One of the cleaner empirical results in the paper is that the bracket structure itself matters. They tried the obvious alternatives — just give the judge all sixteen summaries at once and ask which is best. Or split them into groups of eight and rank within each. Or groups of four. The pairwise version, the bracket, beats all of them.

5:17Hope: Which is intuitive once you say it out loud, but it's the kind of thing where you want to see the data. The intuition is that comparing two things is a much easier judgment task than ranking sixteen. A flat ranking forces the judge to hold everything in its head and produce a global ordering. The bracket replaces one hard global decision with many easy local ones. Same as why human judges in basically any domain prefer A-versus-B over rank-these-twenty.

5:51Finn: The second mechanism is where I think the real magic is. The tournament doesn't just stop at one champion. They take the top four summaries from the first wave, and then they launch a fresh wave of sixteen new attempts. Brand new , brand new environment. Before this new agent takes a single action, it reads those four summaries. So the new attempts inherit context: here's what four previous attempts thought about this bug, what they tried, where they got stuck, what they ruled out.

6:22Hope: It's the relay race image. First wave of runners takes the course. Before the second wave starts, the four most informative runners write down what they learned — where the course gets tricky, what shortcuts didn't work, where they wasted time. The second wave reads those notes before the gun goes off. They call this , . Then those second-wave attempts go through their own tournament, and you get one final winner.

6:50Finn: So the full pipeline, end to end: sixteen attempts, tournament down to four, sixteen new attempts that have read those four summaries, tournament down to one. Thirty-two per problem, plus all the judging.

7:03Hope: And the headline numbers are real. On SWE-Bench Verified — which is the standard benchmark of actual bugs from open-source Python projects — four-point-five Opus jumps from about seventy-one percent to about seventy-eight. On , the harder command-line benchmark with stuff like "recover this corrupted SQLite file," the same model goes from forty-seven to fifty-nine. Pro on Terminal-Bench: fifty-three to sixty-five. Claude four-point-five Sonnet on Terminal-Bench: forty-one to fifty-seven. These are six to sixteen percentage points of accuracy, on top of , with no training, just clever inference.

7:45Finn: Which is the kind of gap that usually comes from a model generation upgrade. It's a lot. But it's also expensive, and we should be honest about that. Thirty-two per problem, plus tournament voting, plus summarization calls. This is multi-x the compute of a single attempt. They're trading a lot of inference for those points.

8:05Hope: Yeah. The cost is real and they don't bury it. But there's a finding inside the paper that I think is the single most interesting thing in the whole experimental section — and it's the one that justifies the architecture, in my read. They take all the second-wave attempts, the ones that read four prior summaries before starting. They bucket those tasks by how many of the four priors actually solved the problem. Like — were the priors any good?

8:32Finn: And the result is just — almost too clean. When zero out of four priors got it right, the new attempts succeed essentially zero percent of the time. Maybe one or two percent. When four out of four priors got it right, the new attempts succeed almost a hundred percent of the time. Ninety-seven, ninety-eight, ninety-nine. The in-between buckets walk smoothly between those two extremes.

8:55Hope: The quality of the prior context essentially determines whether the next attempt succeeds. It's a near-deterministic relationship.

9:04Finn: Once you really sit with that, the entire architecture stops being clever and starts being kind of forced. Because what they've shown is that whatever you give the next wave of attempts to read — that's the ceiling. If you give them garbage, you get garbage. Give them a good seed, you get success. So now the whole question of reduces to: how do you make sure the priors you pass forward are the good ones? That's exactly what the tournament is for.

9:32Hope: The tournament isn't decoration. It's the filtration step that determines whether refinement amplifies signal or amplifies noise. Without picking the priors, is a coin flip on whether the second wave learns from successes or doubles down on the wrong diagnosis.

9:50Finn: There's one more empirical thing I want to flag, because I think it's underplayed in the paper. The step efficiency. The refined attempts — the ones in the second wave that read prior summaries — don't just succeed more often. They succeed faster. Way faster.

10:06Hope: How much faster?

10:07Finn: For four-point-five Opus on SWE-Bench, the average attempt drops from forty-one steps to fourteen. Pro drops from thirty-six to eighteen. The isn't re-grepping the directory structure. It isn't rediscovering which file the bug lives in. It isn't doing the dependency-installation dance for the third time. The summary tells it where to go, and it goes there.

10:33Hope: There's a great concrete example in the appendix. A Django bug. The first wave of attempts kept hitting - "Module-Not-Found-Error" - mid-run because of missing packages. The refined attempt opens by saying, in effect: based on analysis of four prior attempts, I'll preemptively install asgiref, pytz, and sqlparse — and then it goes directly to the buggy line and applies the fix in ten steps instead of forty.

11:00Finn: Which is the right way to think about what the summaries are actually doing. They're not evaluation criteria. They're inherited operational knowledge. The in the second wave is starting from a much more advanced point in the problem-solving process.

11:17Hope: And it makes the cost picture a little less ugly. Yes, you're running thirty-two . But the second sixteen are each shorter than a baseline rollout would have been. So the multi-x is real but not as bad as you'd think from "thirty-two attempts."

11:33Finn: OK. I want to push back on this, because I do think the paper is good, but there's a steelman case I want to put on the table. The biggest one is the judge.

11:43Hope: Yeah, Finn, this is the place I'd push too.

11:45Finn: The judge in is the same that produced the . They report judge accuracy in one of the tables — how often does the judge actually pick the better attempt when we know which one was better? — and at the final rounds it's running between sixty and eighty percent. That's a lot of error, and it's correlated error, because the judge has the same blind spots as the generator. If the model is systematically wrong about a class of bug, it'll be just as wrong evaluating attempts on that bug as it was producing them.

12:19Hope: There's a footnote about Pro specifically having unusually low judge accuracy and correspondingly weaker final improvements, which is a hint that the whole pipeline is bottlenecked on judgment quality. A separately-trained dedicated judge — the kind of thing you'd build with or reinforcement learning on actual -quality labels — is the obvious next step. They explicitly say that. But it isn't tested.

12:47Finn: The second thing I'd flag, Hope, is the bimodal collapse. The paper celebrates that more tasks reach sixteen-out-of-sixteen passing after refinement, and that's true — for on SWE-Bench, the number of tasks where every single attempt succeeds jumps from two-oh-nine to three-fifty. But the number of tasks where every single attempt fails also goes up. From seventy-three to ninety-four.

13:12Hope: Right. Refinement isn't a strict improvement. It's a redistribution. For tasks where the initial attempts were mediocre, the tournament can pick the wrong four to pass forward, and now the second wave is anchored on bad priors. So they flail harder.

13:27Finn: Which connects directly to the context-quality finding. If the priors are garbage, the next wave is garbage, deterministically. The aggregate average rises, the variance gets worse, and for any specific hard task, you might be worse off than just running sixteen attempts and crossing your fingers.

13:47Hope: There's also the framing of "long-horizon coding." That phrase is doing a lot of work. They tested two benchmarks — SWE-Bench and . Both are English-language coding tasks, both have discrete pass-fail outcomes from automated test suites, both fit on a single machine. Whether this same recipe transfers to multi-day software engineering work, web agents, scientific research agents, things without crisp test outcomes — that's genuinely open.

14:19Finn: The summaries themselves are also a free variable. The whole pipeline rests on the assumption that the summary preserves the right information. But the summary is generated by the same model under a fixed prompt. They don't ablate summarization quality. If the model is bad at summarizing on some task family — and it's plausible it would be, since summarization is itself a skill that varies by domain — the substrate of the whole apparatus quietly degrades.

14:50Hope: All those critiques are real. None of them undo the central reframe, though, which is the part I keep coming back to. For two years, the field's mental model of has been "more compute equals more ." Sample more, think longer, make the bigger. This paper is making a different claim. For long-horizon , raw compute hits a wall. Sixteen of forty thousand each is a lot of compute. But if you can't aggregate them — if you can't compare them, can't summarize them, can't pass useful pieces of them forward — that compute is wasted. The bottleneck isn't generation. It's representation.

15:34Finn: And that's a different research agenda. If the bottleneck is representation, then the next frontier isn't longer contexts or smarter base models. It's better experience artifacts. What does an 's lab notebook look like? Should it be textual summaries? Persistent code files? Derived test cases? Reusable tools the agent built for itself in an earlier attempt and now keeps around? The paper gestures at this in its conclusion. Today, prior experience flows through summaries. Tomorrow, it could flow through accumulating workspace state.

16:08Hope: Which makes this paper feel less like a technique and more like a marker. It's the place the field stops asking "how do we make a single smarter" and starts asking "how do we make a sequence of attempts collectively smarter than any one of them." That's much closer to how human teams work than how individual humans think.

16:30Finn: And the deepest line in the paper, the one I'd actually quote — it's something like: you can't select what you can't compare, and you can't reuse what you can't summarize. That's the whole thesis in one sentence.

16:43Hope: It is. And once you've heard it, it's hard to unhear.

16:46Finn: The paper is from Meta Superintelligence Labs and collaborators across UW, NYU, Carnegie Mellon, Princeton, and Google , posted in April. We'll link it in the show notes alongside the related background reading on , , and the original work.

17:05Hope: Thanks for listening to AI Papers: A Deep Dive.