All episodes
Episode 097 · May 29, 2026 · 25 min

Same Tokens, Same Cost, Wildly Different Results: What Actually Scales in AI Agents

Zhang, Wang, Xu et al.

Test-time Scaling Agent Systems NLP
AI Papers: A Deep Dive — Episode 097: Same Tokens, Same Cost, Wildly Different Results: What Actually Scales in AI Agents — cover art
paperdive.ai
Ep. 097
Same Tokens, Same Cost, Wildly Different Results: What Actually Scales in AI Agents
0:00
25 min
Paper
Scaling Laws for Agent Harnesses via Effective Feedback Compute
Venue
arXiv:2605.29682
Year
2026
Read the paper
arxiv.org/abs/2605.29682
Also available on
Apple Podcasts Spotify

Two AI runs spend identical , make identical , and cost the same penny — yet one succeeds 27% of the time and the other 90%. A new paper argues the resource that actually scales agents isn't compute at all, but feedback that's validated, novel, and remembered. If they're right, the reflex to throw more budget at a struggling agent is often just buying more waste.

What you'll take away

  • Why counting , , and cost measures activity, not progress — and on real traces actually predicts worse than guessing the average (negative )
  • : the four-factor score (informative, valid, non-redundant, retained) that's multiplied, not averaged, so missing any one factor zeroes out the whole event
  • The matched-budget experiment that makes the causal case: identical spend on every axis, quality varied alone, success jumps from 27% to 90%
  • Why there's no universally best — the fanciest scaffolding wins on code tasks but loses to simpler ones on software-engineering tasks
  • The honest limitations: author-constructed feedback conditions, a curated slice of real benchmarks, and fitted task-demand — and the prospective holdout that defends against curve-fitting
  • The forward-looking payoff: because the metric can be estimated mid-run from the trace, you could cut off that are spinning and pour budget into the ones genuinely learning

Chapters

  1. 00:00The 27-versus-90 puzzle
  2. 02:32Why training scaling laws don't transfer to agents
  3. 05:04Activity is not progress
  4. 07:36Effective Feedback Compute and the four-factor product
  5. 10:08Task demand: feedback relative to thirst
  6. 12:40From a cloud of dots to a clean curve
  7. 15:12The matched-budget causal test
  8. 17:45Surviving contact with reality
  9. 21:41No universally best harness
  10. 22:49Practical upshot and the adaptive-budget dream

References in this episode

Also available as a plain-text transcript page.

0:00Bella: Two runs of the same AI . Same task. Same model under the hood. They spend the identical number of , make the identical number of , and rack up — down to the penny — the identical cost. By every meter we normally use to measure how hard an AI is working, these two runs are twins. One of them succeeds twenty-seven percent of the time. The other succeeds ninety percent of the time.

0:27Tyler: And the natural question is — if they spent exactly the same, what on earth is the difference?

0:34Bella: That gap is the heart of a paper that went up on on May twenty-eighth, twenty-twenty-six — and we are recording the very next day, on the twenty-ninth. Quick ground rules before we get into it. This episode is AI-generated. The script was written by Anthropic's 4.8. I'm Bella, and the other voice you're hearing is Tyler — we are both AI voices from Eleven Labs, and the producer isn't affiliated with either Anthropic or Eleven Labs. The paper is called "Scaling Laws for Agent Harnesses via ," out of Harbin Institute of Technology. And that twenty-seven-versus-ninety gap is the experiment that, I think, earns the whole thing.

1:19Tyler: It does. But let's not jump to the punchline yet, because the setup is half the insight. Bella, you want to lay out why this problem even exists?

1:28Bella: Sure. So start with something the field is actually good at — for training. For years, the reliable recipe for a better model was: make it bigger, feed it more data, spend more compute. And the beautiful thing is that it's predictable. You can draw a curve. Spend more, and performance climbs along a smooth line you can forecast in advance. There's a clean knob on the x-axis, and the curve tells you what you buy when you turn it.

1:58Tyler: Which is genuinely remarkable, by the way. You don't usually get to predict the future in machine learning. Scaling laws let you do exactly that for .

2:09Bella: Right. Now move to . And when this paper says "agent," it doesn't mean a model answering one question. It means a model wrapped in scaffolding that runs in a loop — make a plan, take an action, call a tool or run some code, look at what the environment says back, update, and try again. That scaffolding is what they call the . The routing, the verification checks, the memory, the retry logic.

2:37Tyler: And the , not just the base model, is what's really driving the outcome here. Same model, two different , completely different behavior.

2:48Bella: Exactly. So the question becomes: is there a for the ? Is there a knob you can put on the x-axis the way model size is a knob for ? And the obvious candidate is just — how much is the doing? Count the . Count the . Count the dollars, the . Spend more, get more. That's the intuition everyone reaches for.

3:14Tyler: And here's where it falls apart, and I think this is the most important thing to feel before any of the math. Counting measures activity. It does not measure progress. And in a , those two things come apart violently. Think about what a churning actually looks like. It runs a command, the command fails. It runs the same command again — fails again. It gets a noisy reading from some tool, doesn't write it down anywhere, forgets it by the next step. It's busy. It's burning tokens at a furious rate. And it is going absolutely nowhere.

3:54Bella: And a different run can spend the exact same gathering information that genuinely changes what it does next.

4:02Tyler: That's the killer. Both runs look identical on the meter. One is learning, one is spinning its wheels, and your counter cannot tell them apart. So as a predictor, it's doomed from the start. You're measuring the wrong thing.

4:18Bella: So the paper's move — and this is the core insight — is to stop measuring how much the spends, and start measuring how much of that spending it converts into feedback that actually mattered. They call it . The whole pitch is right there in one of their sentences: scaling is governed less by how much computation is spent than by how efficiently raw budget gets converted into durable, task-sufficient feedback.

4:48Tyler: And "feedback that mattered" isn't hand-wavy here. They give it a precise structure, which is where it gets interesting.

4:56Bella: Right. So every feedback event — every time the gets something back from its environment — gets scored on four things. One: is it informative? Does it actually reveal something relevant to the task? Two: is it valid? Is it backed by reliable evidence, like a test that passed or a deterministic checker, rather than a guess? Three: is it non-redundant? Is this something new, or just a rerun of what the agent already knew? And four: is it retained? Did the agent actually fold it into its plan or memory going forward, or did it evaporate?

5:33Tyler: And the crucial design choice — the thing that makes this more than a checklist — is that those four numbers get multiplied together. Not averaged. Multiplied.

5:43Bella: Yes. And that distinction is everything. Picture a chain with four links. The chain is only as strong as its weakest link. If one link snaps, it carries no load — it doesn't matter how strong the other three are. That's what multiplication does. A brilliant, novel insight that the immediately forgets? One snapped link. The whole event scores near zero. A that runs a real, valid test but tells you nothing you didn't already know? Snapped link. Near zero. You don't get partial credit for being great at three out of four.

6:20Tyler: And contrast that with averaging, which is what you'd reach for if you weren't thinking carefully. An average says, hey, three out of four, that's a solid seventy-five percent, nice work. The product says: no. Useful feedback has to be all four things at once. Informative and valid and novel and remembered. It's an AND-gate, not a scorecard.

6:43Bella: And the claim baked into that choice is substantive — it's a claim about how feedback actually works. These properties are conjunctive. Miss any one and the feedback is worthless, no matter how good the rest looks.

6:56Tyler: Okay, but I want to push on something, because so far this is one half of the picture. You've got a score for how good the feedback is. But "good enough" depends on the job, right? A trivial lookup and a brutal multi-step debugging task don't need the same amount of feedback to crack.

7:16Bella: That's exactly the second half, and it's why a raw feedback score isn't enough on its own. They divide it by something they call task demand — basically, how feedback-hungry is this particular task? The analogy I'd reach for is watering plants. A cactus needs barely any water. A fern in a hot room is thirsty all the time. Pour the same cup into both and you drown one and starve the other. What matters isn't the absolute amount of water — it's water relative to thirst.

7:46Tyler: So a task with strong built-in checkers, lots of reliable verification available — that's the cactus. It needs very little effective feedback to solve, because the environment is already telling you when you're right.

8:00Bella: Right. And a long task with ambiguous tools, tons of internal state to track, noisy observations — that's the parched fern. It demands a lot of effective feedback before you'll ever solve it. So you take your feedback score and divide by task demand, and now you've got sufficiency — did we gather enough for this task — rather than just raw quantity. And critically, that's what lets you put a simple lookup and a hard debugging job on the same axis and actually compare them.

8:32Tyler: Which matters enormously for the evidence, because the whole paper is a horse race of curves, and you can't race tasks against each other if they're all sitting at different baselines.

8:45Bella: So let's talk about that evidence, because the first big experiment is a beautiful, clean staircase. They build a controlled — procedurally generated tasks with and deterministic right answers, so they can measure everything exactly. And they build a whole spectrum of , from a dumb single-pass "just answer the question" baseline all the way up to full with routing and verification and structured memory.

9:15Tyler: And there's one in that lineup I want to flag, because it's a planted trap. They build one they call "High Budget Noisy." It deliberately spends a ton of raw budget — but with weak routing, weak verification, weak memory. It's the spendthrift that buys nothing.

9:33Bella: The negative control.

9:35Tyler: Exactly. If raw spending were what mattered, this thing should look fantastic — it spends more than almost anyone. And it stays stuck at low success. That's the whole thesis in a single : throwing budget at a leaky harness just buys you more waste.

9:52Bella: So here's the staircase. They take a standard scaling-law curve — failure rate drops as you increase whatever's on your x-axis — and they ask, for each candidate x-axis, how tightly does the data hug that curve? And the way you score that is a number called . Quick intuition for anyone who needs it: R-squared near one means your data points sit right on a clean curve, your x-axis fully governs the outcome. R-squared near zero means the points are a shapeless cloud and your x-axis tells you nothing.

10:27Tyler: And, worth planting now because it pays off later — can go negative. Below zero means your predictor is literally worse than just guessing the average every single time. Hold that thought.

10:41Bella: So picture the same scatter of dots, and we keep swapping the x-axis. Raw on the x-axis? The dots are a cloud. around point three-three — explains about a third of what's going on. Wall-clock time, raw cost — basically the same cloud, high thirties. Tool calls and operations nudge you up to around point four-two. So all the activity-counting measures land in a smear down at the bottom.

11:08Tyler: Which is already damning, honestly. The thing everyone reaches for explains a third of the variation. That's a cloud.

11:16Bella: Then they throw in a strong baseline — a concurrent, more elaborate approach to scaling systems that uses many features at once. Call it the heavyweight multi-feature model. That jumps you to point eight-eight. Real improvement. And then their single trace-level number — — even the estimated version that never peeks at — hits point nine-four. One interpretable scalar beats the heavyweight. And once you normalize by task demand, the oracle version reaches point nine-nine. So the same cloud of dots that wouldn't fit any curve under raw collapses onto an arc you could trace with one stroke of a pencil. That's the picture. Cloud to clean arc.

12:05Tyler: And that's lovely, but I want to be the annoying person in the room for a second, because a correlation like that has an obvious escape hatch.

12:14Bella: Go for it, Tyler.

12:15Tyler: The escape hatch is: of course the high-feedback runs succeed more — they probably just spent more! Maybe effective feedback is just a fancy proxy for "this run did more stuff." Correlation between two things that both rise with effort tells you nothing about which one's in the driver's seat. And this is where the paper does the move that turns it from a nice metric into an actual causal claim. The matched-budget experiment — the one we opened with.

12:45Bella: This is the centerpiece. Walk through how they built it.

12:48Tyler: So they construct pairs of runs on the same task, same model, where the budgets are forced to be identical. Same count. Same number of . Same cost. They report the mean differences in budget as literally zero — if you plotted the two conditions against each other on spending, every point sits exactly on the diagonal. Spending is held fixed by construction. The only thing they vary is the quality of the feedback the gets back. One condition gets noisy, redundant, poorly-retained observations. The other gets targeted, valid, non-redundant feedback that sticks.

13:26Bella: Same money. Different shopping.

13:28Tyler: That's exactly the image. Two people sent to a hardware store with identical budgets — same dollars, same number of items they're allowed to grab. One comes back with precisely the parts to fix the leak. The other comes back with a cart of random, half-useful junk and a thing they already owned at home. Same spend. Wildly different outcome. And the numbers: success goes from twenty-seven percent to ninety percent. With the statistical confidence essentially at certainty. Same budget on every axis — quality alone moves the outcome by sixty-three points.

14:03Bella: And that's the moment the four-factor product earns its keep. Because in the high-quality condition, all four factors rise together — informative, valid, novel, retained, all at once. Which is precisely what a product rewards and an average would have smeared away.

14:22Tyler: It's the cleanest causal statement in the paper. Budget fixed, quality varied, outcome explodes. You can't explain that away with "they just spent more," because they didn't.

14:33Bella: Now, I want to give the skeptic their due here, because there's a real soft spot in even this experiment, and the authors are sort of walking up to it.

14:43Tyler: You mean that the two conditions are constructed by the authors.

14:48Bella: Right. The "high quality" and "low quality" feedback streams aren't found in the wild — the researchers built them to differ along exactly the dimensions their metric cares about. So strictly speaking, the experiment proves that their quality knob, the one they designed, moves outcomes. Which is a slightly weaker statement than "we discovered the quality knob that governs in nature."

15:14Tyler: It's a fair hit. Though I'd say it's the difference between "we proved our knob is real and causal" and "we proved our knob is the only knob." The first is still a strong result. But yeah, you should hold a little reservation there.

15:29Bella: And it connects to the deepest critique of the whole paper, which we should put on the table plainly. The worry is circularity. is built out of properties — informative, valid, non-redundant, retained — that are almost by definition correlated with making progress. So a hardened skeptic asks: did you discover a law of nature, or did you build a very well-engineered relabeling of "the run that did useful things is the run that succeeded"?

16:00Tyler: And that's the right question to be nervous about. If you define your x-axis using ingredients that smell like success, getting a high isn't a miracle — it's a little bit baked in.

16:14Bella: So how does the paper answer that? This is where the generalization gauntlet comes in, and Tyler, this is really your stretch — the part where the idea has to survive contact with reality.

16:27Tyler: Right, and it survives in stages, each one closing off an excuse. First excuse: "this only works because you have oracle access to , so it's a lab toy." Their answer is to build an estimated version that never sees the hidden state and never sees the final success label. It reconstructs the feedback score purely from things you can read off the trace itself — did a checker fire? Did a tool result get referenced again later? Did the 's plan actually change? And that estimated version recovers almost all of the oracle signal, including on executable code tasks.

17:08Bella: So it's not an artifact of privileged access. You can compute it from the trace.

17:13Tyler: Right. And then the second excuse: "your synthetic is too clean." So they run it on real mixed benchmark traces — code tasks, terminal tasks, software-engineering tasks. And here's the most dramatic number in the entire paper. On those real traces, all the activity-counting measures — raw , wall time, cost, , operations — don't just get worse. They go negative. lands somewhere around minus point-zero-eight to minus point-zero-two.

17:46Bella: Negative. Which means —

17:48Tyler: Which means, remember what I flagged earlier — counting on real traces is worse than useless. You'd make better predictions by ignoring the token count entirely and just guessing the average every time. Meanwhile the hardened, real-trace version of their metric — the one that aggressively discounts repeated failures and unstable observations — hits point nine-two.

18:15Bella: That gap is the whole argument in one frame. The standard measure goes below zero; the new one stays up near the top.

18:23Tyler: And then the move I respect the most, because it's the strongest defense against that circularity worry. The prospective holdout.

18:32Bella: Tell people what that means, because it's doing real work.

18:36Tyler: So the obvious comeback to any impressive in-sample fit is: "sure, you tuned your metric to fit the data you already had." So they froze everything. The entire metric definition, the task-demand factors, the fitted exponents, the baselines — all locked down in advance. Then they collected and scored a brand-new batch of traces. No peeking, no retuning. And it held at point eight-five. Lower than the in-sample numbers, which is honest and expected, but a long way from collapsing. That's the result a careful reviewer should lean on, far more than any of the in-sample fits. Pre-registering the whole thing before you see the data is the cleanest way to show you didn't just curve-fit your way to a pretty number.

19:24Bella: I want to be fair about the remaining soft spots, though, because they're real. The whole evaluation is heavily self-constructed — the synthetic tasks, the families, the task-demand factors, even their reimplementation of that heavyweight baseline. It's all built by the same team under the same conceptual frame. And the real-benchmark layer is explicitly filtered down to tasks that have automatic and reproducible sandboxes.

19:54Tyler: Which they're upfront about — they say it's chosen for external validity, not leaderboard comparison. But it does mean the headline real-traces win is on a curated, friendly slice of reality.

20:07Bella: And there's one more honest wrinkle. When they move to heterogeneous mixed tasks, their hand-designed task-demand formula — the watering-can-relative-to-thirst term — actually underperforms. They have to fit its to the data. And fitting free parameters is exactly the kind of flexibility that can inflate a result.

20:29Tyler: To their credit, they say plainly that those fitted are calibration knobs, not universal causal truths. And the prospective holdout is the answer to that worry too — because the fitted version was before the holdout traces existed. So it's not free parameters chasing the test set.

20:48Bella: Now I want to get to the result that I think is the most fun, because it complicates the tidy story in a way I didn't expect. When they decompose efficiency — how good is a given at converting budget into useful feedback — you'd assume the fanciest harness wins everywhere. The deep one with all the bells and whistles. And on code tasks, that's true — the deep dominate, efficiency way up around one-point-nine.

21:17Tyler: And on the other environments?

21:20Bella: On terminal tasks, every is stuck at low efficiency — around point-one across the board. The environment is just intrinsically stingy with clean feedback; nobody can extract good signal from it. And then on software-engineering tasks, the ranking flips. The earlier, mid-stage — including the simpler ones — come out on top.

21:42Tyler: Huh. So there's no universally best .

21:46Bella: There isn't. And the framing they use is that efficiency should be understood as a -task interaction, not a fixed property of the harness alone. The clean image is tools. A power drill is the best thing in the workshop when you've got solid wood. It's useless when the material crumbles. And on a third job, a plain screwdriver wins. The "best tool" isn't a property of the tool — it's a property of the tool meeting the material.

22:15Tyler: And that's a genuinely useful corrective, because there's this persistent fantasy in -land that someone will discover the one true architecture and we'll all just use that. This says: no, the right depends on whether your environment actually yields clean, reusable feedback. Which, honestly, matches what anyone building these things has felt in their gut.

22:39Bella: Let me pull together the practical upshot, because I think it's sharper than people expect. The expensive instinct, when an underperforms, is to give it more. More , more , a bigger budget. And this paper says that instinct is frequently just wrong — and it gives you the reason. Budget only helps to the degree your converts it into feedback that's valid, novel, and remembered. If your scaffolding is leaky — noisy observations, no memory, redundant retries — more budget mostly buys more waste. That's the High Budget Noisy harness, and it's the matched-budget result.

23:18Tyler: And there's a forward-looking piece that I think is the most exciting promise here, even though it's explicitly future work. Because the estimated version can be computed from the trace itself, before you know the outcome — you could in principle watch an mid-run and tell whether it's accumulating real progress or just churning.

23:40Bella: Which is a foundation for adaptive budgets. Instead of handing every run a fixed allowance, you could cut a run off when it's clearly going nowhere and pour more into the ones that are actually learning.

23:53Tyler: Right — "stop spending, this one's dead." That's the dream version. The authors are careful to flag it as promise, not a shipped product. But you can see the shape of it.

24:04Bella: And one caution on the numbers for anyone wanting to check this against their own intuition — the base models and benchmarks the paper names are forward-dated. So the absolute figures are internally consistent, but you can't yet anchor them against a familiar reference point. The story to take away isn't the specific exponents. It's the reframing.

24:27Tyler: And that reframing is the thing I'll actually remember. For systems, the resource that scales isn't compute. It's information — extracted, validated, and retained. Cost is just the price of admission. What you do with it is the whole game.

24:43Bella: Same money, different shopping. Twenty-seven versus ninety.

24:48Tyler: That's the episode in one line.

24:50Bella: The paper is "Scaling Laws for Agent Harnesses via ," from the team at Harbin Institute of Technology. The link's in the show notes, along with a few related reads if you want to go deeper on inference-time scaling.

25:05Tyler: And if you want the full transcript with every bit of jargon tappable for a definition — plus the concept pages that connect this to the other scaling-law episodes we've done — that all lives on paperdive.ai.

25:19Bella: Thanks for spending it with us. This has been AI Papers: A Deep Dive.