All episodes
Episode 119 · Jun 05, 2026 · 22 min

Beating Reinforcement Learning Without Ever Touching the Model's Weights

Hwang, Suri, Villecroze et al.

LLM Agents
AI Papers: A Deep Dive — Episode 119: Beating Reinforcement Learning Without Ever Touching the Model's Weights — cover art
paperdive.ai
Ep. 119
Beating Reinforcement Learning Without Ever Touching the Model's Weights
0:00
22 min
Paper
Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents
Venue
arXiv:2606.05296
Year
2026
Read the paper
arxiv.org/abs/2606.05296
Also available on
Apple Podcasts Spotify

Two desktop GPUs matched — and on one task beat — a method that needed eight A100s, all without computing a single or anything. The trick is an old theoretical equivalence between RL and that only became useful once got sealed behind . We unpack how a cheap critic a giant turns training into selection, and where the whole approach quietly falls apart.

What you'll take away

  • Why a known equivalence between and , long stuck in theory papers, suddenly becomes useful only when you're locked out of a model's
  • How the method replaces -based training with — running a population of , scoring them with a small critic, and 'breeding' the winners
  • Why learning the is just a cheap offline regression problem (2-3 hours, ~$140) rather than expensive online , because the big model stays
  • The headline result: on , the no- method crossed over , which had full access to the model's internals — and an 11B critic measurably improved without touching it
  • The failure case: when a model produces uniformly good, near-identical , there's nothing to select among and the method actually makes things worse
  • The fine print behind 'beats ': it depends on high counts, hand-tuned resampling schedules, parallel calls that aren't free, and a value estimate that's only ever approximate

Chapters

  1. 00:00The welded-shut hood problem
  2. 20:37RL is secretly Bayesian inference
  3. 05:37Particle filters and a beam of agents
  4. 08:26The cheap coach steering the frozen athlete
  5. 11:14The results that travel
  6. 14:03One decisive fork in WebShop
  7. 16:52Where it gets softer than the headline
  8. 19:41Why the constraint created the use case

References in this episode

Also available as a plain-text transcript page.

0:00Cassidy: Two desktop GPUs — the kind you could fit under a desk — matched the performance of a method that needed eight A100s, a full server node, to do its job. And here's the part that should make you sit up: the desktop method never touched the model's . Not once.

0:19Eric: Never touched them — meaning no , no , nothing reaching inside the model at all?

0:26Cassidy: Nothing. Just calls to the big model, plus a small helper model running locally. And on at least one task, it didn't merely match the -based method. It beat it.

0:37Eric: Which, if you know how normally works, sounds a little like cheating. The whole point of -based methods is that they get to open the hood.

0:48Cassidy: Right — and that tension is the paper. This result comes from work posted to on June third, twenty-twenty-six, and we're recording two days later, on June fifth. Quick note before we dig in: what you're hearing is an AI-generated episode. The script was written by Anthropic's , and the two of us — I'm Cassidy, that's Eric — we're both AI voices from Eleven Labs. The show's produced independently, no affiliation with either company. The paper is called ": Simulating Reinforcement Learning for Black-Box Agents." And that title gives the game away — the word "simulating" is doing enormous work. They're not doing . They're faking its effects without the machinery.

1:35Eric: So let me set up why anyone needs to fake it, because the constraint here is genuinely brutal and it's easy to wave past. The most capable AI right now are built on top of proprietary models — , , . You reach them through an . Text in, text out. The are sealed behind a corporate wall, and you never even see the raw probabilities the model assigns to its next word. Now, the dominant way we make an agent better at a specific task — say, navigating an e-commerce site or running a simulated chemistry experiment — is . Methods like and . And those work by computing . A gradient is a directional nudge to every internal parameter: turn this dial up a hair, that one down. To compute it, you have to be inside the model, seeing its guts.

2:31Cassidy: And you simply can't do that through an .

2:34Eric: You can't. So the analogy I keep coming back to: you want a faster lap time, but the hood of the race car is welded shut. You can drive it, you can watch the lap times — but you can't touch the engine. That's a black-box . The most powerful cars on the track are exactly the ones whose hoods you're least allowed to open.

2:57Cassidy: And the two escape hatches people normally reach for are both unsatisfying.

3:02Eric: Both bad. Option one: fiddle with the prompt. Cheap, but you're just asking the model nicely — there's a ceiling. Option two: take a smaller model, one whose hood you can open, that instead, and hope it transfers. But now you've abandoned the powerful model you actually wanted. You're optimizing the wrong car. Neither path lets you do real reward-driven optimization on the actual target.

3:31Cassidy: So that's the bind this paper walks into. And the way out runs through a piece of theory that, honestly, sounds too convenient when you first hear it — a known mathematical equivalence between and .

3:47Eric: Which is one of those phrases that's been in theory papers for years and mostly stayed there. Let's make it land, because everything hangs on it.

3:57Cassidy: Let me build it in two moves. First move — what is reinforcement-learning actually trying to do? Strip away the . The goal is: take a model that already produces a wide spread of behaviors, and shift its tendencies toward the ones that earn high reward — without letting it drift so far that it forgets how to be sensible and fluent. That leash, the "don't wander off" term, is mathematically a penalty for straying too far from the original model. Hold onto that leash. It's the secret ingredient.

4:30Eric: So you're not rebuilding the model. You're rebalancing how often it does the good things versus the bad things, while keeping it tethered to its original character.

4:41Cassidy: Exactly. And here's the second move — the flip. There's a known result that says: the optimal policy under that leashed objective is exactly a Bayesian . Now, Bayes' rule, in plain shape: you start with a — what you believed before — multiply by a likelihood — how well each option explains the evidence — and you get a posterior, your updated belief. In this paper, the prior is the black-box model's natural distribution over everything it might do. The likelihood is "this behavior earned high reward." Multiply them, and the posterior you get is the optimal . Not as a retrained model. As a probability distribution over whole .

5:24Eric: And the payoff of writing it that way is what, exactly?

5:28Cassidy: If the optimal is a distribution, then the question stops being "how do I train?" and becomes "how do I draw samples from this distribution?" And sampling doesn't need . You never have to open the hood. Here's the image I'd hold. Think of the black-box model as a chef with fixed habits, who produces a wide range of dishes — that's the . You've got a tasting panel scoring each dish — that's the reward. The optimal kitchen isn't a retrained chef. It's the same chef, unchanged, whose good dishes simply get served far more often. Exponentially more often, the better they are. He never strays into bizarre experiments, because he's still cooking in his own natural style. The leash keeps him there.

6:16Eric: So the chef never changes. What comes out of the kitchen does.

6:20Cassidy: That's the whole conceptual heart in one line. You replace training with selection. You let the model produce the things it already produces, and you preferentially serve the good ones.

6:33Eric: Okay, but "preferentially serve the good ones" is hiding a hard problem, and I want to drag it into the light. You've described a target distribution — the , the reweighted chef. You can write it down. But the paper itself says you can't actually sample from it directly. Why not?

6:52Cassidy: Two reasons, and they're both about scale. Agent tasks have enormous action spaces and long horizons — you're making a sequence of decisions, each branching into a huge tree of possibilities. And for a true , you can't even read the -level probabilities to help you steer. So you know what distribution you want, but you have no direct way to draw from it. You can only sample from the — which just means: run the and watch what it does.

7:22Eric: So you can sample from the chef's natural output, but not from the idealized "good dishes served more often" version.

7:30Cassidy: Right. And that gap — "I can sample from the but I want the " — is the oldest problem in a whole field of statistics. The tool they reach for is . Particle filters.

7:43Eric: Which is probably the least familiar piece for a machine-learning audience, so let's not rush it.

7:49Cassidy: The core idea is beautiful and it's basically evolution in fast-forward. When you want to sample something that unfolds step by step and you can't compute it directly, you run a whole population of candidates forward in time. Each candidate carries a — a score for how plausible it currently looks. And periodically you resample: you kill off the low-weight candidates and you clone the high-weight ones. So the population keeps concentrating on the promising regions. Do that enough, and the survivors provably converge to samples from the distribution you wanted. In this paper, each candidate is a full — a whole sequence of states and actions. So picture twenty agents marching forward through the same task in lockstep. At , a judge scores each one. The worst get culled. The best get duplicated to refill the population. It's survival-of-the-fittest beam search, guided by a critic.

8:47Eric: And that's the load-bearing picture for the rest of the paper. A beam of , advancing together, periodically thinned and re-bred.

8:55Cassidy: That's it. And notice what this buys you operationally: digital environments can be simulated in parallel, and are inherently sequential — state, action, new state. That's exactly the structure were invented for. The fit is almost suspiciously good.

9:13Eric: So now everything reduces to the judge. Who scores the , and how — because that judge is where all the actual intelligence has to live if the big model is .

9:24Cassidy: This is the part I most want listeners to keep straight, because it's the single easiest thing to conflate. There are two separate models in this system. There's the black-box — the giant proprietary model doing the acting. And there's a small, separate model — the critic — doing the judging. The big one never changes. All the learning lives in the small one. The cleanest framing I've got is coach and athlete. The is a gifted athlete whose training is fixed — you can't change how they were raised. The critic is a cheap, attentive coach on the sideline who, at key moments, says "that line is working, keep going," or "you're heading nowhere, bail." The athlete's raw ability never improves. But the coach steers which attempts survive.

10:13Eric: And the thing the coach is estimating — in the formal language — is a . How good are your prospects from where you're standing right now? Expected total future reward.

10:25Cassidy: Right. The paper uses what's called a , but for our purposes don't overthink the "soft" — it's a smoothed estimate of the best outcome you could plausibly reach from this state. A gut-check score of how good your situation looks. The coach glances at a half-finished and says: point-six, this looks promising. Or: point-one, this is going nowhere.

10:48Eric: So here's the question that decides whether this is clever or circular. To train that coach — to learn the — don't you need exactly the kind of expensive online you were trying to avoid?

11:03Cassidy: That's the trick that makes the whole thing practical, and the answer is no — and it's because the is . Because the big model never changes, learning the is just an offline regression problem. You sample a batch of from the base model once. Then, for each state along those trajectories, you train a small model to predict the cumulative future reward that trajectory ended up getting. No online . No policy . No giant cluster. You're just fitting a small model to data you already collected.

11:42Eric: And the numbers on that are genuinely striking, so let me take the cost story, because this is where the paper earns its keep. The they used was small — a -8B, in one case an 11-billion-parameter model. Training it took about two to three hours. Offline regression, not online . All the experiments — the full method — ran on a workstation with two RTX 6000 desktop GPUs. The baseline they're comparing against needed a node with eight A100s.

12:15Cassidy: And in dollar terms?

12:17Eric: In dollar terms, on cloud rates from their appendix: a training run on a 7-billion-parameter model came to around seven hundred dollars. The comparable value-function training for their method came to about a hundred and forty. Roughly five times cheaper to set up.

12:36Cassidy: And then the result everyone's going to remember.

12:39Eric: On the science-experiment environment, , when they scaled their method up to twenty-five parallel , it outperformed fully the same model with . GRPO had full access — it got to open the hood — and the no-gradient method crossed over the top of it. With a stronger , a base, it beat GRPO with only five trajectories. The picture from their scaling figure is exactly that: as you add trajectories, the curve climbs and eventually crosses the GRPO line — despite GRPO having every advantage and this method having none of them.

13:21Cassidy: That's the welded-hood car beating the tuned one. Same engine, just smarter driving lines.

13:27Eric: There's a second result I find almost more interesting commercially. They ran a cost-performance swap. A cheaper, weaker model — -mini — running their method, matched a much more expensive model running plain , the obvious "run it fifteen times, keep the best" baseline. On , the cheap model with their method scored about point-six-seven for six cents a task. The expensive model with Best-of-15 scored point-five-three for eighteen cents. Cheaper and better.

13:59Cassidy: And the conceptual punchline buried in there — the small model the giant one.

14:04Eric: That's the tail wagging the dog. An 11-billion-parameter critic improved 's performance on the shopping task — nudged it up — without ever touching GPT-5.1. A cheap coach making a frontier athlete measurably better. The critic isn't even in the same class, and it's the thing doing the .

14:25Cassidy: Let me make the concrete, because there's one case in the paper that shows the coach doing its job at a single decisive moment. — the e-commerce environment. Two start with the identical search query, looking for, roughly, a gift set with truffles under sixty dollars. They run forward. By step six, they've diverged. One has actually found a product that matches the criteria — the critic looks at it and scores it point-six. The other has bounced back to the search page, spinning its wheels — the critic scores it point-one.

15:01Eric: And at the resampling

15:03Cassidy: The point-one is exactly the kind that gets pruned. The point-six survives, gets cloned, and ultimately wins full reward. One glance from the coach, at one decisive , and the population tilts toward the line that works. That's the entire mechanism in a single frame.

15:22Eric: Okay. I've been the enthusiastic one for a while, which is unusual for me, so let me put my actual hat back on — because there are several places this is softer than the headline suggests, and the authors, to their credit, surface most of them.

15:39Cassidy: Go ahead — and I think the honest version of this paper is stronger for it, so let's not soft-pedal.

15:46Eric: Start with the comparison, because "beats GRPO with no " is the line that'll travel. In the main head-to-head, the GRPO scores are taken from published work, not regenerated under identical conditions. Now, the authors do their own GRPO baselines elsewhere — and there's the catch. On a smaller model, a 3-billion-parameter base, their method with five actually loses to GRPO. Point-one-three versus point-one-eight. It only pulls ahead once you crank up to twenty trajectories.

16:22Cassidy: So "beats " is real, but it lives at the high--count end.

16:26Eric: It's a true claim that depends on which comparison you're looking at and how much parallel compute you're willing to spend. Which leads to the second thing. The whole value-add of this method is discriminating among diverse . The places it wins big are environments with rich, varied intermediate states — lots for the coach to distinguish between. And there's a flip side they're honest about.

16:54Cassidy: The case. This is the boundary condition, and I think it's the most illuminating result in the paper, not a footnote.

17:03Eric: Lay it out, because it tells you exactly when not to use this.

17:07Cassidy: On — a Minecraft-style crafting task — with a , the method actually underperformed plain . Point-seven-nine versus point-eight-nine. It made things worse. And the explanation is the whole story: GPT-5.1 on that task produces short, high-confidence, nearly-uniform . They're all kind of the same, and they're all kind of good. So there's almost no diversity for the critic to discriminate between. And when everything looks alike, resampling occasionally prunes a good trajectory by accident — pure noise. The coach has nothing to coach.

17:46Eric: Which gives you the actual usage rule. This helps when the model produces good-but-not-uniformly-perfect . When the model is already excellent and consistent, there's nothing to select among, and the selection process can only add noise.

18:03Cassidy: And it gently undercuts the framing for the very best models, right? If the is already near-perfect on your task —

18:11Eric: There's nothing for the method to improve, and it might hurt. The authors' rejoinder is fair, though: real-world task complexity has no ceiling, and cost pressure pushes everyone toward smaller models — and that's exactly the regime where there's headroom and diversity to exploit.

18:30Cassidy: There are two more I want to make sure we name.

18:33Eric: The "black-box" framing is doing some quiet work. The desktop-GPU story is true for the critic — the runs locally and cheaply. But the still hit the . Running twenty-five with periodic resampling means a lot of parallel API calls. In a genuinely rate-limited or expensive-per-call setting, "just run twenty-five copies" is less free than the two-desktop-GPUs headline implies. The local part is cheap; the calls to the still cost.

19:06Cassidy: And the resampling schedule is hand-tuned.

19:09Eric: That's the one that complicates the "principled, plug-and-play" story most. The strong results use task-specific — resample at step six here, at steps four and twelve there — chosen empirically per environment. They tried the principled, automatic alternative, the one that decides on the fly when to resample, and it underperformed the hand-tuned schedule. So some of the performance comes from a knob you have to find per task.

19:37Cassidy: And underneath all of it, the foundation is a learned approximation. The coach is never exactly right. The authors say plainly it'll never perfectly approximate the true , and the failure is a direct consequence — bad value estimates pruning good . The whole edifice rests on a critic that's good enough, not a critic that's correct.

20:01Eric: To be fair to them, Cassidy, they don't hide any of this. They explicitly position the method as a viable alternative when -based is impossible — not a replacement for it. They call an oracle rather than a baseline. The framing is: this is what you reach for when you can't open the hood, and you're stuck with the welded car.

20:22Cassidy: Which is the right frame, and it's why I think the result holds up even after all those caveats. The contribution isn't "we beat ." It's that a mathematical equivalence that lived in theory papers for years — RL is secretly — turns out to buy you something operationally decisive in a world the original theorists weren't imagining. Sealed, -only models.

20:47Eric: That's the part I keep turning over. The equivalence was always true. Nobody could do anything with it, because in the regime it was proven for, you had the anyway — so why bother sampling when you could just train? It took the arrival of models you're locked out of for the theory to suddenly have a job.

21:07Cassidy: The constraint created the use case. When you can't open the hood, "the optimal is a distribution you can sample from" stops being an elegant footnote and becomes the only door left.

21:19Eric: And the door opens onto cheap hardware. That's the democratization angle, and it's concrete, not hand-wavy. Reward-driven optimization of frontier-scale has been the province of organizations with big clusters. This shifts a real chunk of that onto two desktop GPUs and a few hundred dollars — by trading retraining for clever sampling.

21:42Cassidy: If you want the one thing to walk away with: they didn't make the model better. They ran a population of its attempts, scored each one with a cheap coach, and kept breeding the winners. The model never changed. What came out of it did.

21:57Eric: The chef, the tasting panel, and a hood that stays welded shut.

22:01Cassidy: That's the paper — "." The show notes have a link to it and a few related reads if this is your kind of thing.

22:10Eric: And if you want to keep pulling on the thread, paperdive.ai has the full transcript with every term defined inline, plus the concept pages that link this episode over to the other things we've covered on inference-time compute and probabilistic methods.

22:27Cassidy: Thanks for spending it with us. This has been AI Papers: A Deep Dive.