Two Levers for Self-Improving AI: When Rewriting Code Isn't Enough
Concepts in this episode
Click a concept to find related episodes and external papers worth reading. See the full concept index.
About this episode
An AI agent spent many iterations rewriting its own scaffolding to denoise genomic data and hit a wall. Then it was allowed to retrain its own weights — and on the first try, it added two trivial lines of code that any biologist would have spotted, cutting error by twenty percent. A new paper argues that scaffold edits and weight updates reach fundamentally different places, and that no self-improvement loop touching only one is going to be enough.
What you'll take away
- Why scaffold rewrites and weight updates are not interchangeable — they change different things (how the agent searches vs. what the model knows)
- How SIA's Feedback-Agent reads full agent trajectories to decide which lever to pull, and even picks which RL algorithm to use
- Concrete results across three deliberately different domains: Chinese legal classification, CUDA kernel optimization on H100s, and single-cell RNA-seq denoising
- Why the headline 502% improvement is real but misleading — the mechanism claim is closer to a 20% gain over the harness-only ceiling
- The 'coupled co-evolutionary Goodhart' failure mode the authors themselves flag: two optimizers converging on a verifier rather than the underlying problem
- What the paper does and doesn't prove — a credible proof of concept, not a settled result, with clean verifiers doing more work than the framing admits
Chapters
- 00:00The two-line fix that broke a plateau
- 03:08Two camps that haven't been talking
- 06:17Inside the SIA architecture
- 09:26Three benchmarks, three shapes of expertise
- 12:34Picking the RL algorithm on the fly
- 16:23The skeptic pass
- 18:53Coupled co-evolutionary Goodhart
- 22:00What this would mean if it generalizes
References in this episode
- Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents — A leading example of the scaffold-evolution camp the episode contrasts with weig
- The Surprising Effectiveness of Test-Time Training for Abstract Reasoning — Akyürek et al.'s test-time-training work, representing the opposite camp SIA tri
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models — Introduces GRPO, the RL algorithm the Feedback-Agent picks for the LawBench task
- The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery — Another reference point in the scaffold-iteration lineage SIA positions itself a
Full transcript
Also available as a plain-text transcript page.
0:00Juniper: An AI has been grinding away on a genomics problem for many iterations. It's been rewriting its own code, sweeping hyperparameters, swapping out approaches. The error metric drifts down, plateaus, drifts down a little more, plateaus harder. The loop has hit a wall. Then the researchers flip a different switch — they let the model retrain its own weights. And on the very first try, the model adds two lines of code: clip the negative numbers, round to the nearest integer. That's it. The error drops about twenty percent in one shot.
0:36Tyler: And what makes that moment land is that those two lines encode something any biologist would tell you in five seconds. You can't have a negative count of RNA molecules in a cell. You can't have half a molecule either. It's a biological invariant — trivially true, trivially correct.
0:57Juniper: Right. And the AI's scaffolding loop — the one that had been editing its own code, trying every variation — never found it. Across many iterations. The paper went up on arXiv yesterday, May twenty-sixth, twenty-twenty-six, and we're recording on May twenty-seventh. Quick note before we dig in: this episode is AI-generated, the script's from Anthropic's Claude Opus 4.7, and the two voices you're hearing — I'm Juniper, that's Tyler — are both AI voices from Eleven Labs. The producer isn't affiliated with either company. The paper is "SIA: Self Improving AI with Harness and Weight Updates," from a team at Hexo Labs with a collaborator at Oxford, and the reason that two-line fix is worth opening with is that it's the cleanest demonstration in the paper of a much bigger claim.
1:50Tyler: The bigger claim being — those two interventions, rewriting the code around the model versus updating the model's weights, are not interchangeable. They reach different places. The scaffold loop could have run forever and never produced "clip and round." The weight update found it in one pass.
2:10Juniper: Exactly. So let me set up the puzzle the paper is responding to, because the framing is actually pretty elegant. For the last few years, two research communities have been chasing self-improving AI from opposite ends — and they've barely talked to each other. One camp says: take the model as fixed, and let an AI rewrite the scaffolding around it. The prompts, the parsers, the retry logic, the tool calls — all the software that wraps the model. Think Darwin Gödel Machine, Meta-Harness, the AI Scientist line of work. The model is a frozen engine; the AI is editing the car around it. The other camp says the opposite. Leave the scaffolding alone — but let the model retrain its own weights, on the fly, on whatever task you're trying to solve. That's the test-time-training line. TTRL, the Akyürek work on test-time adaptation. They build an RL pipeline that nudges the weights based on whatever reward signal the task provides.
3:15Tyler: And each camp has an implicit critique of the other baked into its silence. The scaffold people don't touch the model's actual knowledge. You can write the world's best prompt for diagnosing fraud cases — if the model has never internalized what fraud looks like, the prompt can only do so much. The weight-update people don't touch the search procedure. You can retrain the model's intuitions all day, but if the agent around it can't parse outputs cleanly or recover from a failed tool call, you're losing most of the gain to plumbing.
3:52Juniper: And the SIA question is the embarrassingly obvious one: what if you let an AI do both? Not sequentially, not on a fixed schedule — but in a single loop where a meta-agent watches what's happening and decides, at each step, which knob to turn.
4:08Tyler: The analogy that's been useful for me here is the new-hire one. You've got someone starting a specialized job. You can help them two completely different ways. You can give them better tools and procedures — a clearer checklist, a better template, a senior colleague to escalate to. Or you can send them to training where they actually learn the domain. Both make them better. But they're different kinds of better, and they unlock different ceilings. The harness is the checklist and the tools. The weights are the training. A great checklist can't teach domain intuition. Deep domain knowledge can't substitute for a workflow that catches edge cases. And once you see it that way, the question "which one should we automate" stops making sense. The answer is obviously both — you just need a system that can tell which the agent needs right now.
5:05Juniper: That's exactly the move SIA makes. So let me walk through the architecture, because it's simpler than you'd think. There are three roles, all played by language models. A Meta-Agent that takes the task description and writes version one of the scaffold — basically, here's the initial agent. Then the task-specific agent itself, which is that scaffold wrapped around a base model — they use gpt-oss-120b — and it actually runs against the dataset and produces what they call a trajectory. Every prompt, every model response, every tool call, every tool result, every final answer. The whole record. And then the third role, the Feedback-Agent, which is the brain of the loop. It reads the source code of the current scaffold, reads the full trajectory, looks at the performance metrics, and decides what to do next. It has two actions available. It can rewrite the scaffold — emit a new version of the agent's code. Or it can trigger a weight update — actually retrain the model on what just happened.
6:13Tyler: And when it picks weight update, there's a second decision: which RL algorithm to use. The paper gives the Feedback-Agent a menu of about half a dozen of them, and it picks one based on what the reward landscape looks like. We'll come back to that — it's one of the more interesting pieces.
6:32Juniper: The thing I want to flag first, though, is the design choice of giving the Feedback-Agent the full trajectory rather than just the metrics. That matters. It's not reacting to a falling accuracy number — it's reading what actually happened. It can say: the parser is dropping outputs that end in a code block. It can say: rollouts that compile but exceed shared memory are getting zero reward and that's dragging the gradient signal down. It diagnoses specific failure modes the way a human engineer would, and then chooses an intervention that matches the diagnosis.
7:09Tyler: It's the coach watching practice footage. Sometimes what you need is a tactical adjustment — different stance, different grip, run the play differently. That's a scaffold edit. Sometimes what you need is conditioning — build the underlying capacity that no tactic can substitute for. That's a weight update. A good coach watches the footage and diagnoses which kind of intervention is going to help right now. The Feedback-Agent is doing that, except its "footage" is the structured trajectory of an AI agent solving a task, and its toolkit is either "rewrite the code" or "retrain the model."
7:48Juniper: So that's the loop. Now — does it actually pay off?
7:51Tyler: The three benchmarks they pick are deliberately chosen for contrast. They want to make the case that this isn't a trick that works on one kind of problem. So they grab three genuinely different domains, each with a different shape of "what does expertise look like." Juniper, do you want to walk through them?
8:12Juniper: Sure. First one is LawBench. It's a Chinese legal classification task — you read a case description, you pick which of one hundred ninety-one specific criminal charges applies. Random guessing gets you under one percent. The base model running through a minimal scaffold gets thirteen and a half percent. Previous state of the art on this benchmark was forty-five percent.
8:36Tyler: So the prior SOTA is already doing something nontrivial.
8:40Juniper: Right. And here's the arc. They turn on SIA with just the harness lever — scaffold updates only, no weight changes. The Feedback-Agent watches the trajectory and starts building, on its own, a fairly classical machine learning pipeline around the model. It puts a text-classifier on top, tunes the feature extraction, layers in some preprocessing. Gets to fifty percent. Already past the prior best, just from scaffold iteration. Then progress stalls. The Feedback-Agent switches modes — flips to weight updates. It picks an RL algorithm called GRPO, which we'll come back to, and applies direct gradient pressure to the model on the actual classification task. Adds another twenty points. Final number: seventy point one percent.
9:29Tyler: And the interesting thing about that twenty-point jump is what it represents. The base model, even wrapped in a beautifully tuned scaffold, was confusing fine-grained categories — subtypes of theft, grades of assault, variants of fraud. Those are distinctions you can't engineer your way around with better prompting. Either the model has internalized the difference between two adjacent legal charge categories or it hasn't. The scaffold can't teach it. The weight update can.
10:00Juniper: Second benchmark is much grittier. Writing a custom CUDA kernel — low-level GPU code — for a specific operation in AlphaFold 2, running on H100 hardware. This is a problem where you have to know things about how the chip's memory hierarchy works. Shared memory tiling, register-level accumulation, block size selection. Hardware-specific patterns.
10:24Tyler: And the base model going in, with a minimal scaffold, produces a kernel that runs. But not fast. They have a baseline runtime. After scaffold iteration alone, the Feedback-Agent gets the runtime down a modest amount — call it a fourteen percent speedup. That's the harness ceiling. Then weight updates kick in. The final speedup is roughly fourteen times faster than the original baseline. About a ninety-two percent reduction in runtime relative to where the harness alone could get. And the paper's read on what happened is that the model internalized H100-specific patterns. The kind of knowledge that lives in a senior CUDA engineer's head — feel for the chip — got written into the weights.
11:11Juniper: Tyler, this is the one that I think makes the mechanism distinction most legible to someone outside the field. Because you can squint at the LawBench result and say, well, maybe the scaffold could have gotten further with more iterations. But the CUDA result is harder to wave away. Those hardware-specific micro-optimizations aren't going to fall out of a prompt rewrite. They have to be learned.
11:39Tyler: Right. And then the third domain, which is the one we opened with — single-cell RNA-seq denoising. You measure which genes are active in individual cells, the measurements are extremely noisy because the sampling process is sparse, and you want to recover the true activity levels. They use an existing denoising algorithm called MAGIC. It's hyperparameter-heavy. The harness loop spends iteration after iteration sweeping those hyperparameters — k, t, alpha. Tuning them. Tuning them again. Plateaus at a certain error level. Then the first weight update — first one — adds two lines of post-processing. Clip negative values. Round to integers. Twenty percent improvement on top of the harness ceiling.
12:27Juniper: And the chef analogy that's been running in my head with this one is — imagine a chef who's been tweaking their plating, their garnish, their sauce reduction for weeks, trying to get a dish right. And then they finally taste a competitor's version and realize the rice has to be rinsed first. Not plated differently, not garnished differently. Rinsed. It's a one-step prep rule that no amount of plating refinement could ever discover, because it lives at a different layer of the problem. The harness was rearranging the plate. The weight update rinsed the rice.
13:05Tyler: That's the cleanest formulation I've heard for what the paper is actually claiming. And it generalizes. The harness shapes how the agent searches; weight updates change what the model knows. That sentence is essentially the spine of the paper.
13:22Juniper: Okay — I want to spend a few minutes on the RL algorithm selection piece, because it's one of the genuinely novel things SIA does, and we should at least gesture at it. Tyler, you want to take this?
13:35Tyler: Sure. So when the Feedback-Agent decides to do a weight update, it doesn't just run one fixed RL recipe. It picks an algorithm from a menu. The paper lists about six. I'm not going to walk through all of them — that way lies a graduate seminar — but the principle is what matters. Different RL algorithms make different assumptions about what the reward landscape looks like. And the Feedback-Agent's job is to look at the trajectory and match the algorithm to the landscape. Two examples, just to make it concrete. The one that ran on LawBench is called GRPO. The fishing version of GRPO is: imagine a lake teeming with fish. You can throw out a hundred lines at once, see what you catch, and reward whichever attempts beat the group average. That works when rollouts are cheap, the verifier scores cleanly at the end, and successes aren't rare. Classification is a perfect fit — generate a bunch of candidate answers, score them all, push the model toward the ones that beat the pack.
14:37Juniper: And the contrast case?
14:39Tyler: The contrast is what happens when you're after a single rare fish in a huge lake. Casting a thousand lines and averaging the catch is useless — almost everything is zero, and the average is basically zero. You have to study the rare success carefully and weight it heavily. That's the problem in tasks where most attempts fail outright — hard proofs, low-pass-rate code synthesis. The algorithm they use there is called entropic advantage weighting, and the one-sentence version is: exponentially up-weight the rare wins so they don't get buried in the average. Different reward shape, different algorithm. And the Feedback-Agent's claim is that it's reading the reward histogram, the rollout cost, the variance, and picking the algorithm whose assumptions match. The Feedback-Agent is itself an LLM reasoning about which technique fits the situation.
15:33Juniper: That move — delegating not just "should we update weights" but "which method should we use to update weights" — is the part that feels like a real new layer of automation. Because if you've ever set up an RL pipeline by hand, you know that algorithm choice is one of the things humans are spending real time on. And SIA is saying: an LLM can read the trajectory and make that call.
15:58Tyler: With one big caveat. The paper shows the Feedback-Agent making these choices and getting good results. It doesn't isolate how much of the gain comes from algorithm selection specifically. We don't have an ablation where it's forced to always use GRPO, versus always pick its preferred algorithm. So we can say the dynamic-selection version works. We can't say with confidence how much the dynamic part is doing. Juniper, that's actually a good segue into the harder questions. Want me to do the skeptic pass?
16:32Juniper: Please.
16:33Tyler: Okay. A few things to push on. The biggest is that the headline ablation in the paper isn't fully symmetric. The Feedback-Agent always starts with scaffold iteration. Runs that until progress stalls. Then switches to weight updates. Which means the experimental story is: scaffold-then-weights beats scaffold-alone. But the stronger claim — the one the paper's framing actually implies — is that dynamic interleaving is the key. That the Feedback-Agent's moment-to-moment choice of which knob to turn is doing real work. And to test that, you'd want a comparison against a much simpler baseline: just iterate the scaffold until convergence, and then run a fixed RL pipeline at the end. That baseline might capture most of the gain. The paper doesn't run it.
17:23Juniper: That's fair. So we know the combined approach beats the harness-only approach. We don't really know whether the cleverness is in the combination or in the dynamic choice.
17:34Tyler: Right. Second thing — the benchmark selection is small and the verifiers are clean. Three tasks. Each one has a deterministic verifier and a clear scalar reward. And the whole framework's elegance depends on having a verifier the Feedback-Agent can both invoke and reason about. Most real-world AI problems don't come with clean verifiers. That's often the hard part. The paper doesn't engage with what happens when your verifier is noisy, gameable, or expensive to run.
18:06Juniper: And the third thing I'd flag is the headline number in the abstract. They report a five hundred and two percent improvement on the denoising task. That's measured against the initial baseline, not against the harness-only ceiling. When you compare against where scaffold iteration actually plateaued, the weight-update contribution is more like twenty percent. Which is still a real, honest result. But the abstract picks the bigger number.
18:35Tyler: Both numbers are real. They just answer different questions. "How much better than nothing" gets you five hundred percent. "How much better than the harness alone" gets you twenty. The second is the mechanism claim. The first is the marketing.
18:51Juniper: And then there's the limitation the authors themselves raise, which I think deserves real airtime because it's subtle and important. They call it coupled co-evolutionary Goodhart.
19:04Tyler: This is the deepest worry in the paper.
19:07Juniper: Yeah. So Goodhart's Law in its standard form: when a measure becomes a target, it ceases to be a good measure. You optimize too hard against a proxy, the proxy stops tracking the thing you actually care about. The standard analysis assumes one optimizer. SIA has two. Both the scaffold search and the weight updates are optimizing against the same verifier. And they're not independent — the scaffold finds configurations that are easy for the current weights to exploit, and the weights train on data that was collected by a scaffold that's about to change. You can imagine these two processes converging on a joint configuration that scores beautifully on the verifier and falls apart under any perturbation. They've reached a fixed point that fits each other rather than fitting the underlying problem.
19:59Tyler: The two-students-quizzing-each-other version. They prep for the exam by quizzing each other. Each one tunes their answers to what the other rewards. Each one rewards what fits their evolving sense of the test. They can converge on a shared model of the exam that scores brilliantly when they quiz each other and falls apart on the real test. Because their shared model has drifted to fit each other, not the underlying material. And the paper doesn't test out-of-distribution robustness. We don't actually know whether SIA's final scaffold-plus-weights combination generalizes, or whether it overfits the verifier in a more sophisticated way than scaffold-only solutions do.
20:42Juniper: To be clear, this is a concern the authors raise themselves. They flag it explicitly. It's not an external attack on the paper — it's an honest acknowledgement that the two-lever architecture introduces a coupled failure mode that the one-lever architectures didn't have.
20:59Tyler: And the right framing is probably: this is a paper proving a possibility, not a paper that's settled the question. They've shown that the combination can reach places neither lever reaches alone. They haven't shown that those places are robust.
21:15Juniper: Right. So pulling back to why this all matters — Tyler, my read is that the contribution here is more a reframing than a result. For the last few years, the field has had two cleanly separated stories about how an AI gets better. The agent story is better scaffolding. The model story is better weights. SIA's claim is that these aren't alternatives. They're complementary. They touch genuinely different change-spaces. And once you accept the synthesis, a bunch of follow-on questions open up.
21:48Tyler: The practical version of "why this matters" is something like — if this generalizes, building specialized AI for a new domain stops being a months-long process of hiring engineers to prompt-tune and then ML researchers to design an RL pipeline. It starts looking more like: specify the task, write a verifier, let the loop run. That's a big "if." The three demonstrated domains all have clean verifiers and well-defined tasks. The hard cases in the real world are exactly the ones where the verifier is the problem. But the picture the paper paints, if you take it seriously, is one where the human role keeps moving up the stack. You're not writing the scaffold. You're not writing the RL pipeline. You're not picking the algorithm. You're writing the task spec and the verifier.
22:39Juniper: And the honest version is — there are still humans all over this paper. Humans wrote the Feedback-Agent's prompts. Humans picked the base model. Humans defined the verifiers, picked the benchmarks, decided when to call it done. The paper removes humans from a new layer. It doesn't remove humans from the loop.
22:59Tyler: But it removes them from a layer that, until pretty recently, no one thought you could automate. The choice between "rewrite the agent" and "retrain the model" — that has been a human research-engineering judgment call. SIA is one of the first concrete demonstrations that an LLM can make that call, watch what happens, and iterate on its own decision. Whether that's the start of a new default in the field or just one good idea among several is genuinely open. The bibliography is unusually thin on independent replication — several of the works it cites as precedents are dated this year, concurrent rather than well-established. So this is an early paper in a fast-moving area. Treat it as a credible proof of concept, not a settled result.
23:48Juniper: That feels like the right place to land. The one image I want listeners to walk away with is the rice. The harness was rearranging the plate. The weight update rinsed the rice. Two trivial lines of code — clip and round — that no amount of scaffold iteration was ever going to produce, because the fix lived at a different layer of the problem. And the paper's bet is that a lot of real problems have rice that needs rinsing somewhere in them. Layers where the fix isn't a procedure or a prompt — it's a thing the model has to know. If that's right, then any self-improvement loop that touches only one lever is leaving the other ceiling on the table.
24:30Tyler: That's the spine of it. Harness updates shape how the agent searches. Weight updates change what the model knows. Get both, get places neither could reach.
24:40Juniper: The paper's linked in the show notes, along with some further reading if you want to go deeper on either of the two threads — the scaffold-evolution side or the test-time-training side. Both have rich literatures that this paper is trying to bridge.
24:56Tyler: And if you want the full transcript with definitions inline, plus the concept pages that tie this episode to the other work on agents and self-improvement we've covered, that's all on paperdive.ai.
25:09Juniper: Thanks for listening to AI Papers: A Deep Dive.