Beating Reinforcement Learning Without Ever Touching the Model's Weights
Concepts in this episode
Click a concept to find related episodes and external papers worth reading. See the full concept index.
About this episode
Two desktop GPUs matched — and on one task beat — a reinforcement learning method that needed eight A100s, all without computing a single gradient or fine-tuning anything. The trick is an old theoretical equivalence between RL and Bayesian inference that only became useful once frontier models got sealed behind APIs. We unpack how a cheap critic steering a frozen giant turns training into selection, and where the whole approach quietly falls apart.
What you'll take away
- Why a known equivalence between reinforcement learning and Bayesian inference, long stuck in theory papers, suddenly becomes useful only when you're locked out of a model's weights
- How the method replaces gradient-based training with Sequential Monte Carlo — running a population of agent trajectories, scoring them with a small critic, and 'breeding' the winners
- Why learning the value function is just a cheap offline regression problem (2-3 hours, ~$140) rather than expensive online RL, because the big model stays frozen
- The headline result: on SciWorld, the no-gradient method crossed over GRPO, which had full access to the model's internals — and an 11B critic measurably improved GPT-5.1 without touching it
- The TextCraft failure case: when a model produces uniformly good, near-identical trajectories, there's nothing to select among and the method actually makes things worse
- The fine print behind 'beats GRPO': it depends on high trajectory counts, hand-tuned resampling schedules, parallel API calls that aren't free, and a value estimate that's only ever approximate
Chapters
- 00:00The welded-shut hood problem
- 20:37RL is secretly Bayesian inference
- 05:37Particle filters and a beam of agents
- 08:26The cheap coach steering the frozen athlete
- 11:14The results that travel
- 14:03One decisive fork in WebShop
- 16:52Where it gets softer than the headline
- 19:41Why the constraint created the use case
References in this episode
- Levine: Reinforcement Learning and Control as Probabilistic Inference — The tutorial that lays out the exact RL-as-Bayesian-inference equivalence this e
- Sequential Monte Carlo Steering of Large Language Models using Probabilistic Programs — Directly applies the particle-filter/SMC machinery the episode describes to stee
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models — Introduces GRPO, the gradient-based RL 'oracle' baseline the episode repeatedly
- Training Verifiers to Solve Math Word Problems — The work that popularized training a separate verifier/value model to score and
Full transcript
Also available as a plain-text transcript page.
0:00Cassidy: Two desktop GPUs — the kind you could fit under a desk — matched the performance of a reinforcement learning method that needed eight A100s, a full server node, to do its job. And here's the part that should make you sit up: the desktop method never touched the model's weights. Not once.
0:19Eric: Never touched them — meaning no gradients, no fine-tuning, nothing reaching inside the model at all?
0:26Cassidy: Nothing. Just API calls to the big model, plus a small helper model running locally. And on at least one task, it didn't merely match the gradient-based method. It beat it.
0:37Eric: Which, if you know how reinforcement learning normally works, sounds a little like cheating. The whole point of gradient-based methods is that they get to open the hood.
0:48Cassidy: Right — and that tension is the paper. This result comes from work posted to arXiv on June third, twenty-twenty-six, and we're recording two days later, on June fifth. Quick note before we dig in: what you're hearing is an AI-generated episode. The script was written by Anthropic's Claude Opus 4.8, and the two of us — I'm Cassidy, that's Eric — we're both AI voices from Eleven Labs. The show's produced independently, no affiliation with either company. The paper is called "Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents." And that title gives the game away — the word "simulating" is doing enormous work. They're not doing reinforcement learning. They're faking its effects without the machinery.
1:35Eric: So let me set up why anyone needs to fake it, because the constraint here is genuinely brutal and it's easy to wave past. The most capable AI agents right now are built on top of proprietary models — GPT-5, Gemini, Claude. You reach them through an API. Text in, text out. The weights are sealed behind a corporate wall, and you never even see the raw probabilities the model assigns to its next word. Now, the dominant way we make an agent better at a specific task — say, navigating an e-commerce site or running a simulated chemistry experiment — is reinforcement learning. Methods like PPO and GRPO. And those work by computing gradients. A gradient is a directional nudge to every internal parameter: turn this dial up a hair, that one down. To compute it, you have to be inside the model, seeing its guts.
2:31Cassidy: And you simply can't do that through an API.
2:34Eric: You can't. So the analogy I keep coming back to: you want a faster lap time, but the hood of the race car is welded shut. You can drive it, you can watch the lap times — but you can't touch the engine. That's a black-box frontier model. The most powerful cars on the track are exactly the ones whose hoods you're least allowed to open.
2:57Cassidy: And the two escape hatches people normally reach for are both unsatisfying.
3:02Eric: Both bad. Option one: fiddle with the prompt. Cheap, but you're just asking the model nicely — there's a ceiling. Option two: take a smaller open-weight model, one whose hood you can open, fine-tune that instead, and hope it transfers. But now you've abandoned the powerful model you actually wanted. You're optimizing the wrong car. Neither path lets you do real reward-driven optimization on the actual target.
3:31Cassidy: So that's the bind this paper walks into. And the way out runs through a piece of theory that, honestly, sounds too convenient when you first hear it — a known mathematical equivalence between reinforcement learning and Bayesian inference.
3:47Eric: Which is one of those phrases that's been in theory papers for years and mostly stayed there. Let's make it land, because everything hangs on it.
3:57Cassidy: Let me build it in two moves. First move — what is reinforcement-learning fine-tuning actually trying to do? Strip away the gradients. The goal is: take a model that already produces a wide spread of behaviors, and shift its tendencies toward the ones that earn high reward — without letting it drift so far that it forgets how to be sensible and fluent. That leash, the "don't wander off" term, is mathematically a penalty for straying too far from the original model. Hold onto that leash. It's the secret ingredient.
4:30Eric: So you're not rebuilding the model. You're rebalancing how often it does the good things versus the bad things, while keeping it tethered to its original character.
4:41Cassidy: Exactly. And here's the second move — the flip. There's a known result that says: the optimal policy under that leashed objective is exactly a Bayesian posterior. Now, Bayes' rule, in plain shape: you start with a prior — what you believed before — multiply by a likelihood — how well each option explains the evidence — and you get a posterior, your updated belief. In this paper, the prior is the frozen black-box model's natural distribution over everything it might do. The likelihood is "this behavior earned high reward." Multiply them, and the posterior you get is the optimal agent. Not as a retrained model. As a probability distribution over whole trajectories.
5:24Eric: And the payoff of writing it that way is what, exactly?
5:28Cassidy: If the optimal agent is a distribution, then the question stops being "how do I train?" and becomes "how do I draw samples from this distribution?" And sampling doesn't need gradients. You never have to open the hood. Here's the image I'd hold. Think of the black-box model as a chef with fixed habits, who produces a wide range of dishes — that's the prior. You've got a tasting panel scoring each dish — that's the reward. The optimal kitchen isn't a retrained chef. It's the same chef, unchanged, whose good dishes simply get served far more often. Exponentially more often, the better they are. He never strays into bizarre experiments, because he's still cooking in his own natural style. The leash keeps him there.
6:16Eric: So the chef never changes. What comes out of the kitchen does.
6:20Cassidy: That's the whole conceptual heart in one line. You replace training with selection. You let the model produce the things it already produces, and you preferentially serve the good ones.
6:33Eric: Okay, but "preferentially serve the good ones" is hiding a hard problem, and I want to drag it into the light. You've described a target distribution — the posterior, the reweighted chef. You can write it down. But the paper itself says you can't actually sample from it directly. Why not?
6:52Cassidy: Two reasons, and they're both about scale. Agent tasks have enormous action spaces and long horizons — you're making a sequence of decisions, each branching into a huge tree of possibilities. And for a true black box, you can't even read the token-level probabilities to help you steer. So you know what distribution you want, but you have no direct way to draw from it. You can only sample from the prior — which just means: run the agent and watch what it does.
7:22Eric: So you can sample from the chef's natural output, but not from the idealized "good dishes served more often" version.
7:30Cassidy: Right. And that gap — "I can sample from the prior but I want the posterior" — is the oldest problem in a whole field of statistics. The tool they reach for is Sequential Monte Carlo. Particle filters.
7:43Eric: Which is probably the least familiar piece for a machine-learning audience, so let's not rush it.
7:49Cassidy: The core idea is beautiful and it's basically evolution in fast-forward. When you want to sample something that unfolds step by step and you can't compute it directly, you run a whole population of candidates forward in time. Each candidate carries a weight — a score for how plausible it currently looks. And periodically you resample: you kill off the low-weight candidates and you clone the high-weight ones. So the population keeps concentrating on the promising regions. Do that enough, and the survivors provably converge to samples from the distribution you wanted. In this paper, each candidate is a full agent trajectory — a whole sequence of states and actions. So picture twenty agents marching forward through the same task in lockstep. At checkpoints, a judge scores each one. The worst get culled. The best get duplicated to refill the population. It's survival-of-the-fittest beam search, guided by a critic.
8:47Eric: And that's the load-bearing picture for the rest of the paper. A beam of agents, advancing together, periodically thinned and re-bred.
8:55Cassidy: That's it. And notice what this buys you operationally: digital environments can be simulated in parallel, and agent trajectories are inherently sequential — state, action, new state. That's exactly the structure particle filters were invented for. The fit is almost suspiciously good.
9:13Eric: So now everything reduces to the judge. Who scores the trajectories, and how — because that judge is where all the actual intelligence has to live if the big model is frozen.
9:24Cassidy: This is the part I most want listeners to keep straight, because it's the single easiest thing to conflate. There are two separate models in this system. There's the frozen black-box agent — the giant proprietary model doing the acting. And there's a small, separate model — the critic — doing the judging. The big one never changes. All the learning lives in the small one. The cleanest framing I've got is coach and athlete. The frontier model is a gifted athlete whose training is fixed — you can't change how they were raised. The critic is a cheap, attentive coach on the sideline who, at key moments, says "that line is working, keep going," or "you're heading nowhere, bail." The athlete's raw ability never improves. But the coach steers which attempts survive.
10:13Eric: And the thing the coach is estimating — in the formal language — is a value function. How good are your prospects from where you're standing right now? Expected total future reward.
10:25Cassidy: Right. The paper uses what's called a soft value function, but for our purposes don't overthink the "soft" — it's a smoothed estimate of the best outcome you could plausibly reach from this state. A gut-check score of how good your situation looks. The coach glances at a half-finished trajectory and says: point-six, this looks promising. Or: point-one, this is going nowhere.
10:48Eric: So here's the question that decides whether this is clever or circular. To train that coach — to learn the value function — don't you need exactly the kind of expensive online reinforcement learning you were trying to avoid?
11:03Cassidy: That's the trick that makes the whole thing practical, and the answer is no — and it's because the prior is frozen. Because the big model never changes, learning the value function is just an offline regression problem. You sample a batch of trajectories from the base model once. Then, for each state along those trajectories, you train a small model to predict the cumulative future reward that trajectory ended up getting. No online rollouts. No policy gradients. No giant cluster. You're just fitting a small model to data you already collected.
11:42Eric: And the numbers on that are genuinely striking, so let me take the cost story, because this is where the paper earns its keep. The value model they used was small — a Llama-3.1-8B, in one case an 11-billion-parameter model. Training it took about two to three hours. Offline regression, not online RL. All the AMC experiments — the full method — ran on a workstation with two RTX 6000 desktop GPUs. The GRPO baseline they're comparing against needed a node with eight A100s.
12:15Cassidy: And in dollar terms?
12:17Eric: In dollar terms, on cloud rates from their appendix: a GRPO training run on a 7-billion-parameter model came to around seven hundred dollars. The comparable value-function training for their method came to about a hundred and forty. Roughly five times cheaper to set up.
12:36Cassidy: And then the result everyone's going to remember.
12:39Eric: On the science-experiment environment, SciWorld, when they scaled their method up to twenty-five parallel trajectories, it outperformed fully fine-tuning the same model with GRPO. GRPO had full gradient access — it got to open the hood — and the no-gradient method crossed over the top of it. With a stronger prior, a GPT-5.1 base, it beat GRPO with only five trajectories. The picture from their scaling figure is exactly that: as you add trajectories, the curve climbs and eventually crosses the GRPO line — despite GRPO having every advantage and this method having none of them.
13:21Cassidy: That's the welded-hood car beating the tuned one. Same engine, just smarter driving lines.
13:27Eric: There's a second result I find almost more interesting commercially. They ran a cost-performance swap. A cheaper, weaker model — GPT-4.1-mini — running their method, matched a much more expensive model running plain Best-of-15, the obvious "run it fifteen times, keep the best" baseline. On SciWorld, the cheap model with their method scored about point-six-seven for six cents a task. The expensive model with Best-of-15 scored point-five-three for eighteen cents. Cheaper and better.
13:59Cassidy: And the conceptual punchline buried in there — the small model steering the giant one.
14:04Eric: That's the tail wagging the dog. An 11-billion-parameter critic improved GPT-5.1's performance on the shopping task — nudged it up — without ever touching GPT-5.1. A cheap coach making a frontier athlete measurably better. The critic isn't even in the same weight class, and it's the thing doing the steering.
14:25Cassidy: Let me make the steering concrete, because there's one case in the paper that shows the coach doing its job at a single decisive moment. WebShop — the e-commerce environment. Two agents start with the identical search query, looking for, roughly, a gift set with truffles under sixty dollars. They run forward. By step six, they've diverged. One has actually found a product that matches the criteria — the critic looks at it and scores it point-six. The other has bounced back to the search page, spinning its wheels — the critic scores it point-one.
15:01Eric: And at the resampling checkpoint —
15:03Cassidy: The point-one trajectory is exactly the kind that gets pruned. The point-six survives, gets cloned, and ultimately wins full reward. One glance from the coach, at one decisive fork, and the population tilts toward the line that works. That's the entire mechanism in a single frame.
15:22Eric: Okay. I've been the enthusiastic one for a while, which is unusual for me, so let me put my actual hat back on — because there are several places this is softer than the headline suggests, and the authors, to their credit, surface most of them.
15:39Cassidy: Go ahead — and I think the honest version of this paper is stronger for it, so let's not soft-pedal.
15:46Eric: Start with the GRPO comparison, because "beats GRPO with no gradients" is the line that'll travel. In the main head-to-head, the GRPO scores are taken from prior published work, not regenerated under identical conditions. Now, the authors do fine-tune their own GRPO baselines elsewhere — and there's the catch. On a smaller model, a 3-billion-parameter base, their method with five trajectories actually loses to GRPO. Point-one-three versus point-one-eight. It only pulls ahead once you crank up to twenty trajectories.
16:22Cassidy: So "beats GRPO" is real, but it lives at the high-trajectory-count end.
16:26Eric: It's a true claim that depends on which comparison you're looking at and how much parallel compute you're willing to spend. Which leads to the second thing. The whole value-add of this method is discriminating among diverse trajectories. The places it wins big are environments with rich, varied intermediate states — lots for the coach to distinguish between. And there's a flip side they're honest about.
16:54Cassidy: The TextCraft case. This is the boundary condition, and I think it's the most illuminating result in the paper, not a footnote.
17:03Eric: Lay it out, because it tells you exactly when not to use this.
17:07Cassidy: On TextCraft — a Minecraft-style crafting task — with a GPT-5.1 prior, the method actually underperformed plain Best-of-15. Point-seven-nine versus point-eight-nine. It made things worse. And the explanation is the whole story: GPT-5.1 on that task produces short, high-confidence, nearly-uniform trajectories. They're all kind of the same, and they're all kind of good. So there's almost no diversity for the critic to discriminate between. And when everything looks alike, resampling occasionally prunes a good trajectory by accident — pure noise. The coach has nothing to coach.
17:46Eric: Which gives you the actual usage rule. This helps when the model produces good-but-not-uniformly-perfect trajectories. When the model is already excellent and consistent, there's nothing to select among, and the selection process can only add noise.
18:03Cassidy: And it gently undercuts the framing for the very best models, right? If the frontier model is already near-perfect on your task —
18:11Eric: There's nothing for the method to improve, and it might hurt. The authors' rejoinder is fair, though: real-world task complexity has no ceiling, and cost pressure pushes everyone toward smaller models — and that's exactly the regime where there's headroom and diversity to exploit.
18:30Cassidy: There are two more I want to make sure we name.
18:33Eric: The "black-box" framing is doing some quiet work. The desktop-GPU story is true for the critic — the value model runs locally and cheaply. But the agent rollouts still hit the API. Running twenty-five trajectories with periodic resampling means a lot of parallel API calls. In a genuinely rate-limited or expensive-per-call setting, "just run twenty-five copies" is less free than the two-desktop-GPUs headline implies. The local part is cheap; the calls to the frontier model still cost.
19:06Cassidy: And the resampling schedule is hand-tuned.
19:09Eric: That's the one that complicates the "principled, plug-and-play" story most. The strong results use task-specific checkpoints — resample at step six here, at steps four and twelve there — chosen empirically per environment. They tried the principled, automatic alternative, the one that decides on the fly when to resample, and it underperformed the hand-tuned schedule. So some of the performance comes from a knob you have to find per task.
19:37Cassidy: And underneath all of it, the foundation is a learned approximation. The coach is never exactly right. The authors say plainly it'll never perfectly approximate the true value function, and the TextCraft failure is a direct consequence — bad value estimates pruning good trajectories. The whole edifice rests on a critic that's good enough, not a critic that's correct.
20:01Eric: To be fair to them, Cassidy, they don't hide any of this. They explicitly position the method as a viable alternative when gradient-based RL is impossible — not a replacement for it. They call GRPO an oracle rather than a baseline. The framing is: this is what you reach for when you can't open the hood, and you're stuck with the welded car.
20:22Cassidy: Which is the right frame, and it's why I think the result holds up even after all those caveats. The contribution isn't "we beat reinforcement learning." It's that a mathematical equivalence that lived in theory papers for years — RL is secretly Bayesian inference — turns out to buy you something operationally decisive in a world the original theorists weren't imagining. Sealed, API-only models.
20:47Eric: That's the part I keep turning over. The equivalence was always true. Nobody could do anything with it, because in the regime it was proven for, you had the weights anyway — so why bother sampling when you could just train? It took the arrival of models you're locked out of for the theory to suddenly have a job.
21:07Cassidy: The constraint created the use case. When you can't open the hood, "the optimal agent is a distribution you can sample from" stops being an elegant footnote and becomes the only door left.
21:19Eric: And the door opens onto cheap hardware. That's the democratization angle, and it's concrete, not hand-wavy. Reward-driven optimization of frontier-scale agents has been the province of organizations with big clusters. This shifts a real chunk of that onto two desktop GPUs and a few hundred dollars — by trading retraining for clever sampling.
21:42Cassidy: If you want the one thing to walk away with: they didn't make the model better. They ran a population of its attempts, scored each one with a cheap coach, and kept breeding the winners. The model never changed. What came out of it did.
21:57Eric: The chef, the tasting panel, and a hood that stays welded shut.
22:01Cassidy: That's the paper — "Agentic Monte Carlo." The show notes have a link to it and a few related reads if this is your kind of thing.
22:10Eric: And if you want to keep pulling on the thread, paperdive.ai has the full transcript with every term defined inline, plus the concept pages that link this episode over to the other things we've covered on inference-time compute and probabilistic methods.
22:27Cassidy: Thanks for spending it with us. This has been AI Papers: A Deep Dive.