All episodes

Episode 193 · Jul 02, 2026 · 22 min

Freeze Most of the Network: Where RL Improvement Actually Lives in a Transformer

Zhang, Hu, Glentis et al.

LLM Post-training

AI Papers: A Deep Dive — Episode 193: Freeze Most of the Network: Where RL Improvement Actually Lives in a Transformer — cover art

paperdive.ai

Listen

Ep. 193

Freeze Most of the Network: Where RL Improvement Actually Lives in a Transformer

0:00

22 min

Concepts in this episode

Training Methods Mechanistic Interpretability AI Efficiency & Cost RL Post-Training GRPO Math Benchmarks Ablation Studies Scaling Laws Agentic Coding LoRA Credit Assignment Parallel Sampling

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training

Venue

arXiv:2607.01232

Year

2026

Read the paper

arxiv.org/abs/2607.01232

Also available on

Apple Podcasts Spotify

Train just ten layers of a 36-layer model with reinforcement learning and you beat training all 36 — because the improvement doesn't spread across the network, it concentrates in a handful of middle layers. This episode traces where, physically, RL adaptation lands inside a transformer, why a zero-cost 'just train the middle' heuristic beats the standard recipe on math, and where the headline overreaches the evidence.

What you'll take away

Why RL improvement concentrates in a small set of middle layers rather than spreading evenly — a clean inverted-U across 36 floors that repeats across seven models, two families, three algorithms, and three task domains
The 'door' dissociation: middle layers matter not because they move more (weight change is roughly uniform) but because of leverage — the quality of a layer's parameter subspace, not the distance it travels
A zero-cost heuristic — train the geometric middle layers by position alone, no profiling — that beats full-parameter training and recovers ~21% of the total RL gain for free
That the important layers are fixed during pretraining and portable across tasks (rankings correlate ~0.59 across math and code), so RL just moves into a room that was already built
The steelman critique: 'one layer is enough' is softer than the title — many single-layer wins sit at the edge of noise, and the training strategies were only validated on math
Why a panel of seven layer-specialists (34% answer overlap) beats sampling one model seven times — structural diversity over sampling diversity

Chapters

00:00Fewer moving parts, better score?
01:46What RL is actually sharpening
03:06One floor at a time
04:23A ruler for the hill climb
06:03The hump in the middle
08:01Is the finding real, or lucky?
09:41The same floors, a different job
11:15Distance or leverage?
13:46Skip the scan, train the middle
17:08How much of this really holds?
20:15Remember the door

References in this episode

LoRA: Low-Rank Adaptation of Large Language Models — The canonical parameter-efficient fine-tuning method that this episode's 'train
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models — Introduces GRPO, the exact RL algorithm the episode uses to sharpen math perform
The Unreasonable Ineffectiveness of the Deeper Layers — A companion perspective on where capability lives in the transformer stack, show
Self-Consistency Improves Chain of Thought Reasoning in Language Models — The sampling-diversity majority-vote baseline the episode's 'panel of layer-spec

Full transcript

Also available as a plain-text transcript page.

0:00Juniper: Here's a result that reads like a typo. Take a language model with 36 layers. Instead of fine-tuning all of them with reinforcement learning — the way basically everyone does it — freeze most of the network and train just ten. On a suite of math benchmarks, that stripped-down version scored 69.1 percent. Train the full model, all 36 layers, and you get 66.4. Fewer moving parts, better score.

0:26Eric: Quick heads up before we get into it — this is an AI-made explainer, both voices included.

0:32Juniper: And by the end of this you'll understand where, physically, inside one of these networks the improvement from reinforcement learning actually lives — because it turns out it's not smeared across the whole thing. It's concentrated in a handful of layers in the middle. So concentrated that training one well-chosen middle layer can recover most of what you'd get from retraining everything.

0:57Eric: Which is strange, right? Because the whole premise of full-parameter RL is that improvement is a team effort. Every layer pitches in a little, you nudge all of them together, and the model gets better as a coordinated whole. That's the assumption baked into the default recipe. Nobody had actually checked whether it was true.

1:19Juniper: And this matters beyond the curiosity, because RL post-training is one of the most expensive, most finicky stages in building a modern model. If most of the payoff comes from a small, predictable set of layers, then a lot of that compute is being spent moving parameters that barely matter — and you could stop.

1:39Eric: So let's set the stage properly. Picture a transformer as a vertical stack of identical floors. Text comes in at the bottom, gets transformed step by step as it rises, and the finished answer comes out the top. The model anchoring most of this paper has 36 of those floors. Each floor has its own set of adjustable numbers — its parameters.

2:01Juniper: And the model is built in two big phases. Pretraining comes first — it reads an enormous amount of text and learns to predict what comes next, which is where it gets fluent language and broad knowledge. Then post-training, where reinforcement learning polishes that fluent model toward things you actually want — correct math, working code, following instructions. RL isn't teaching it to talk. It's the final sharpening of a skill that's already there.

2:31Eric: The method they use for that sharpening is GRPO, and you don't need the equations. Here's the whole idea: for each question, the model generates a small batch of its own answers, those answers get scored right or wrong, and the model gets nudged toward the ones that beat the batch average. No separate judge network — it just races several of its own guesses against each other and leans toward the winners.

2:58Juniper: So the question the paper asks is almost anatomical. When you run that process and the model gets better at math — which floor in the stack absorbed the improvement? And to answer it, they do something clean. For a model with 36 layers, they train each layer, one at a time, all by itself. Freeze the other 35, unfreeze one, run the exact same RL, measure the result. Then do it again for the next layer. Thirty-six separate training runs, one per floor.

3:29Eric: And there's a subtle point in that setup that's easy to miss but it's load-bearing. When you freeze all but one layer, the learning signal still flows backward through the entire network. That's just how backpropagation works — the error at the output gets passed back down through every floor. So the frozen layers aren't asleep. They're still shaping the feedback. You're just refusing to update their numbers.

3:57Juniper: Right — the whole orchestra plays and listens. Only one musician is allowed to adjust their part. So the real question each layer is being asked is: given the full network's feedback, how much of the total improvement can your parameters alone absorb?

4:15Eric: And to score that, they need a ruler. This is the one piece of math worth holding onto, and it's genuinely simple. Think of it as a hill climb. Set two markers. The bottom is the model's score before any RL. The top is its score after full RL — everything trained. Now train a single layer and see where it lands on that slope.

4:37Juniper: So the contribution of a layer is just: what fraction of the climb did it manage on its own? Halfway up the hill is a contribution of one-half. Reaching the top — matching full training — is 1.0. Overshooting the top, actually beating full training, is above 1.0. And here's the one that stings: if training a layer drags the model below where it started, that's a negative number. It walked downhill.

5:05Eric: And every single one of those cases actually shows up in the data.

5:11Juniper: They do. On the smallest model they tested, one middle layer hit a contribution of 1.14 — so it edged past full training by itself. Meanwhile another layer, further up, managed only 0.28 — barely a quarter of the climb. That's a four-to-one spread between the best and worst floor in the same network.

5:32Eric: And the negative one is the vivid proof. On the 8-billion model, the very first layer — layer zero, right at the input — came in at minus 0.51. Train that layer in isolation and math performance dropped below the untrained base model. It didn't just fail to help. It actively made things worse.

5:52Juniper: So already the tidy story — every layer pitches in equally — is dead. But the part that turns this from a curiosity into a finding is what happens when you plot it. And this is the image to hold in your head for the rest of the video.

6:09Eric: Which is what, exactly? Because "some layers matter more than others" on its own could just be noise — you'd expect some scatter.

6:18Juniper: It's not scatter. Picture that 36-story building again, and plot each floor by how much learning happens there. The lobby barely moves the needle. The penthouse barely moves the needle. But the middle floors light up — contribution climbs as you go up from the bottom, peaks somewhere around 40 to 60 percent of the way up, then falls back down toward the top. It's a clean hump, an inverted-U, peaking dead center. And a few of those middle floors poke right above the line that marks full-training performance.

6:52Eric: And the reason that's convincing rather than a fluke is that they didn't see it once. They saw the same hump across seven different models, two model families, three different RL algorithms, and three completely different task domains — math, code, and interactive agent tasks. Same shape every time. Every model had at least one layer at or above full-training contribution, and every model had that middle concentration.

7:20Juniper: Now, I want to flag something here, because a careful person is already forming the objection. When I say a few layers "poke above the full-training line," a lot of those are just barely above — 1.01, 1.02. The truly dramatic single-layer wins are real, but some of them sit close to the noise. The robust version of the claim is that a single well-chosen layer recovers most of the climb. That gap between "most" and "beats" matters, and we'll come back to it.

7:50Eric: Good — because that's exactly where a skeptic goes first. My immediate reaction to "one layer matched full training" isn't excitement, it's suspicion. Maybe the single-layer runs got a lucky learning rate. Or maybe they hobbled the full baseline to make the comparison look good.

8:10Juniper: And the way they handle that is one of the strongest parts of the paper. They tuned the learning rate to make the full-parameter baseline as good as it could possibly be. Then they forced every single-layer run to use that same rate. So the deck is stacked in favor of full training — no individual layer got a custom setting to help it shine.

8:33Eric: And they went further, because the obvious follow-up is "well, maybe the weak layers just needed a bigger learning rate to wake up." So they retrained the best and worst layers at three times the rate. The rankings didn't budge — the worst layers' contribution moved by at most two-hundredths. You can't rescue a weak layer by pushing harder on it. Its position in the ranking is real.

8:59Juniper: The other thing they checked is whether the improvement was even genuine — or just overfitting to the math training set. So they took each layer-trained model and tested it on out-of-domain stuff: code, science reasoning, language. And the layers that were good at absorbing math RL were also the ones that improved broadly. The correlation was above 0.6. So a strong middle layer is capturing real capability, not memorizing the training questions.

9:29Eric: Which sets up what I think is the deepest empirical claim in the paper — deeper than "the middle matters." The layer rankings are portable.

9:38Juniper: Meaning the same floors stay important even when you change the job?

9:43Eric: Exactly that. They compared the ranking of layers from a math dataset against the ranking from a coding dataset — different task entirely. Those rankings correlate at 0.59. Two different math datasets correlate at 0.76. On that zero-to-one scale, that's the same layers staying important as you swap the data underneath them.

10:05Juniper: And that's a bigger deal than it sounds, because it says the important layers aren't chosen by the task. They're baked in during pretraining. RL isn't deciding which layers should carry the load — it's just finding the layers that were already primed to adapt, before any post-training happened. It lines up with earlier work showing that the critical layers for math reasoning get set during pretraining and stay put afterward.

10:34Eric: So the geography is fixed before RL ever shows up. RL just moves into the neighborhood that was already built for it.

10:42Juniper: Okay. So here's where the story could go wrong in the most boring possible way — and the paper spends its sharpest section making sure it doesn't. This is the part that decides whether the whole finding is deep or trivial, and it pays off in a single dissociation you can picture with a door.

11:02Eric: Because the lazy explanation is right there. Middle layers matter most — obviously, because middle layers change the most during training. Right? They move further, so they do more of the work. If that's true, the finding is kind of boring. Of course the parts that move the most matter the most.

11:23Juniper: So they measured it. During full training, how far does each layer's parameters actually travel? And the answer is: they all move by about the same amount. The weight change is close to uniform across the whole stack — the middle layers do not move more than the rest. Roughly the same distance, top to bottom.

11:44Eric: And yet the middle layers contribute far more. Same movement, wildly different payoff. That's the whole thing.

11:52Juniper: And they nailed it down even harder. When you force a single layer to absorb all the improvement, the high-contribution layers and the low-contribution layers move through parameter space by similar amounts — but produce completely different results. So it's not distance. It's not effort.

12:11Eric: This is the door. Imagine shoving a heavy door open. Push near the hinge with all your strength and almost nothing happens. Push near the handle with the same force and it swings wide. Same effort — the result depends entirely on where you apply it. Leverage, not force.

12:29Juniper: The middle layers are the handle. And the thing the paper is measuring, in their words, is the quality of a layer's parameter subspace for capturing RL improvement — not how far its weights travel. It's about where you push, not how hard.

12:44Eric: And once you accept that, the negative layer stops being weird. Layer zero moved just as much as any other layer — it just moved in a direction that made the model worse. It's pushing on the hinge and somehow bending the frame.

12:59Juniper: So let me consolidate where we are, because we've made a few moves. The improvement from RL isn't spread evenly — it lives in a small set of middle layers. That pattern holds across models, algorithms, and tasks. It's stable enough to be a property of the pretrained network, not the task. And it's not explained by which layers move the most — it's about which layers have the leverage. All of which sets up the obvious question: if you know that, can you exploit it?

13:29Eric: And this is where it turns from an observation into a recipe. They tried three strategies, escalating in cleverness. The first two both amount to "pay attention to the good layers" — give the high-contribution layers a bigger learning rate, or just freeze everything else and train only the top few. Both beat standard full-parameter training.

13:51Juniper: But the third one is the sleeper, and it's the one worth remembering. Skip the profiling entirely. Don't measure anything. Don't run the expensive per-layer scan. Just train the geometric middle layers — the ones in the physical center of the stack, chosen by position alone.

14:09Eric: Wait — no measurement at all? You just point at the middle floors and train those?

14:15Juniper: That's it. And on all three of the Qwen3 models, it beat full-parameter training. On the 8-billion model, just training the middle by position gained about 1.76 points over the full baseline — which is 21 percent of the total improvement RL bought in the first place — for free. No profiling, no extra runs, no cost. You'd have trained fewer parameters and come out ahead.

14:41Eric: And that's the practical prize, because that per-layer scan is not cheap — each training run was eight top-end GPUs for about four hours, and a full scan means dozens of them. The whole point of the position heuristic is you skip all of that and still win.

15:00Juniper: And going back to where we opened — if you do spend the effort and pick the ten best layers on the 8-billion model, that's the 69.1 versus 66.4. So whether you profile carefully or just crudely grab the middle, training fewer of the right layers beats training all of them.

15:19Eric: There's one more result I think is worth a beat, because it's the strangest one. They took the top seven layer-trained models — each one a model where a different single layer did the learning — and looked at which problems each one actually got right.

15:38Juniper: And they don't overlap much.

15:40Eric: Barely a third. The average overlap between any two of them was about 34 percent. Similar accuracy overall, but they're solving different problems. Each layer-specialist has its own blind spots.

15:54Juniper: So you can pool them like a panel of specialists.

15:58Eric: Right — take a majority vote across the seven. On one of the harder benchmarks, that panel hit 33.6 percent, versus 28.3 for the best single layer and 26.9 for the full-parameter baseline. But here's the comparison that actually means something. They also tried the standard trick — take one full model and sample it seven times, then vote. Same model, seven attempts. That got 31.3. The panel of different layers beat it.

16:28Juniper: So structural diversity beats sampling diversity. Seven genuinely different specialists cover more ground than one expert in seven slightly different moods. Though the paper is honest that this one is an analysis tool, not a shipping recipe — you have to train seven separate models to build the panel, which defeats the point of saving compute.

16:53Eric: And that honesty is a good segue, because I want to push on how much of this actually holds up — not to knock it down, but because the framing is doing some work the numbers don't fully support.

17:08Juniper: Go for it.

17:09Eric: So the title is "Is One Layer Enough?" — and the strong reading is "one layer matches full training." And technically, yes, single layers reach contribution of 1.0 and a bit above. But a lot of those wins are 1.01, 1.02, 1.06 — right at the edge of the noise band. Several are within a standard deviation or two of the baseline. The genuinely robust claim isn't "one layer is enough." It's "one well-chosen layer recovers most of the gain." Still a real result — but softer than the headline.

17:45Juniper: And you'd say the same about the guided-training wins?

17:49Eric: I would. On the 8-billion model, the best guided strategy beats full training by about 2.7 points, and the paper frames that as 32 percent of the total RL gain. Which sounds huge — a third of the benefit. But it sounds huge partly because the total gain is small. Math RL only moved the needle six to ten points to begin with. So a third of a small number is a small number dressed up in a big percentage. The real question a reviewer asks is: does this hold at larger scale, or on harder tasks where full RL actually matters more? We don't know yet.

18:29Juniper: That's fair. And the strategies themselves were only validated on math.

18:34Eric: That's the part I'd underline. The beautiful cross-task consistency — the hump showing up in code and in agent tasks — that was demonstrated for the contribution *metric*. The training strategies, the part you'd actually use, were only tested on mathematical reasoning. The authors say so directly. So the analysis is broad, but the payoff is narrow. And on a couple of the seven models — the agentic and distilled ones — they only scanned some of the layers and interpolated the rest, so the full hump is well-supported but not exhaustively mapped everywhere.

19:11Juniper: I'll concede all of that. The precise, defensible version of this paper is quieter than the title: a single well-chosen layer recovers most of the RL gain, the important layers cluster in the middle and are fixed at pretraining, and a zero-cost position heuristic beats full training on math. That's still a genuinely useful, genuinely surprising result. It's just not "one layer, done."

19:38Eric: And the biggest gap is the one they're most honest about — nobody knows *why*. They show the middle matters. They rule out the boring explanation that it just moves more. But there's no mechanistic theory for why the center of the stack is where RL adaptation lands. It's a well-documented pattern, not an understood one. And I actually think that's the most interesting thing to sit with, not a weakness to paper over.

20:06Juniper: Which is a good place to land the real takeaway. Because the method — train the middle, freeze the rest — that's the practical souvenir. But the durable idea is bigger than any recipe. For years the working assumption has been that RL improves a model as a coordinated, whole-network effort. This says the opposite: RL mostly reshapes a small, stable set of layers that were already primed before any post-training started. The network has a fixed functional geography, and RL just moves into the room that was built for it. And the fact that a single layer can sometimes beat the whole network hints that training everything at once might actually be stepping on itself in places.

20:55Eric: So here's the question I'd put to you. If the important layers are set at pretraining and portable across tasks — do you profile a model once, cheaply, and then reuse that layer selection for every RL job you ever run on it? Or is that too fragile to trust, and you keep unfreezing everything just to be safe? If you've actually run RL fine-tuning on a budget, you already have an instinct here — I'd like to know which way it points, so drop it in the comments.

21:28Juniper: The full annotated version of this episode is on paperdive.ai — every technical term tap-to-define, with links to the related papers on layer importance and parameter-efficient tuning, grouped by theme.

21:43Eric: Quick housekeeping: this script was written by Anthropic's Claude Opus 4.8, Juniper and I are both AI voices from Eleven Labs, and the producer isn't affiliated with either company. The paper is "Is One Layer Enough?", posted just yesterday — July first, twenty twenty-six — and we're recording the day after.

22:05Juniper: So the next time someone tells you RL improves a model everywhere at once — remember the door. Same shove, wrong spot, nothing moves. The trick isn't pushing harder. It's knowing where the handle is.