All episodes

Episode 177 · Jun 26, 2026 · 25 min

Why Raw Profiler Data Made an AI Worse at Writing GPU Code

Gai, Zhang, Bostrom et al.

Systems ML

AI Papers: A Deep Dive — Episode 177: Why Raw Profiler Data Made an AI Worse at Writing GPU Code — cover art

paperdive.ai

Listen

Ep. 177

Why Raw Profiler Data Made an AI Worse at Writing GPU Code

0:00

25 min

About this episode

Paper

Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization

Venue

arXiv:2606.26453

Year

2026

Read the paper

arxiv.org/abs/2606.26453

Also available on

Apple Podcasts Spotify

Feeding a language model detailed hardware measurements about its GPU code made the code slower than telling it nothing at all — and that counterintuitive result is the foundation for a system that wrote a kernel from scratch beating the human experts who hand-tuned the production version. The fix wasn't more data; it was a deterministic layer that pre-digests measurements into expert-style diagnoses. You'll learn why interpretation beats raw access, and exactly where the headline claims hold up and where they're thinner than they look.

What you'll take away

Why raw hardware counters made the model slower (1.8x) than giving it no profiling data at all (3.3x) — and why that gap is the paper's most confident result
How KernelPro splits 'reading the profiler' from 'writing the code,' encoding 15 expert heuristics as deterministic tools that output diagnoses, not numbers
Why the SASS disassembly tool caught 37 kernels silently falling back to slow scalar code that no utilization metric could have detected
How the Monte Carlo Tree Search uses log-scaled rewards and a hard correctness wall to avoid being seduced by easy wins on garbage code
The production case study where a from-scratch kernel climbed from 14x slower to 1.23x faster than expert engineers over 18 iterations — and why the skeptic calls it an N-of-one result
Where the claims weaken: speedups measured against unoptimized PyTorch, unfair cross-system comparisons, and a 'headline' search-memory feature that didn't clear significance

Chapters

00:45How can information make you worse?
01:56What actually makes a kernel fast?
04:02The category error everyone was making
07:56Checking the receipt against the kitchen
10:35The search that refuses to quit
14:46Does it actually hold up?
17:26Beating the humans, once
20:46Same speed, less power
22:54Diagnose first, then prescribe

References in this episode

KernelBench: Can LLMs Write Efficient GPU Kernels? — The standard 250-task benchmark across three difficulty tiers that this episode'
Mastering the Game of Go with Deep Neural Networks and Tree Search — The AlphaGo paper that popularized the Monte Carlo Tree Search algorithm KernelP

Full transcript

Also available as a plain-text transcript page.

0:00Juniper: Researchers handed an AI more data about what its GPU code was doing wrong — and the code got worse. Worse than if they'd told it nothing at all. And this isn't a fluke they're hand-waving past. They ran the statistics.

0:16Eric: Quick heads up before we start — this is an AI-made explainer, both voices included.

0:21Juniper: So here's the promise. By the end of this you'll understand why, for one of the hardest crafts in computing, raw measurement data actively makes a language model dumber — and how a system built on that single counterintuitive idea wrote a GPU kernel from scratch that beat the human experts who'd already hand-tuned the production version.

0:44Eric: But let's sit on the weird part first, because it shouldn't be possible. Information isn't supposed to hurt you. Worst case, you ignore it. If I give the model the profiler readout and it doesn't help, fine — but how do you get to actively worse than silence?

1:03Juniper: That's the whole paper, really. The numbers: feed the model the raw hardware counters and you get an average speedup of about one-point-eight times over the baseline. Give it no profiling data at all — just "make this faster" — and you get three-point-three. The version with nothing nearly doubles the version with everything. And the gap is highly statistically significant; this is the result they're most confident in.

1:31Eric: And why this matters beyond one paper — writing fast GPU code is the bottleneck under all of modern AI. It takes scarce experts, and every new chip generation reopens the problem. So "can a model do it, and how should we wire it up" is a question with real money behind it. The paper's the system called KernelPro, out of Amazon, posted this week.

1:55Juniper: Let me set the table on what a GPU kernel even is, because everything hangs on it. A kernel is a small program that runs in massive parallel across the thousands of little arithmetic units on a graphics card — the code underneath every model's training and inference. Writing one that's correct is easy. Writing one that's fast is brutal, because the chip has a strict pecking order of memory. Registers are instant. Shared memory, on-chip, is fast. The big off-chip pool, DRAM, is slow. A kernel that does the right math but keeps waiting on DRAM can be a hundred times slower than one arranged to keep its data close.

2:36Eric: And the expert's whole job is rearranging code to respect that hierarchy. They run a profiler — the tool that reads out the flood of low-level statistics the chip records while it runs — they spot a bottleneck, they fix it, and then the bottleneck moves somewhere else and they do it again. It's a loop.

2:57Juniper: Right, and here's the observation that cracks the paper open. What makes those experts valuable is not that they can read numbers off a profiler. Anyone can read "occupancy is six percent." Occupancy, by the way, is just how much of the chip's parallel capacity is actually in use — six percent means the silicon is sitting nearly empty. The number is trivial to read. What's scarce is the diagnostic reasoning that turns that number into "here's why, and here's the fix" — heuristics people refine over years and bury in personal scripts.

3:33Eric: So think of it like a blood panel. You hand a smart non-specialist a printout — thirty values, all technically right there in front of them. None of it tells them what to do. An experienced physician doesn't read the numbers one at a time; they instantly fuse "this value plus that value" into "early kidney stress, do X." The diagnosis is the scarce thing, not the data.

3:58Juniper: And that's exactly where the existing LLM kernel systems made what the authors call a category error. They hand the model the raw telemetry and ask it to do two completely different jobs at once — interpret the hardware counters and write creative optimized code, in one shot. But reading counters is a structured, rule-governed task. Writing fast code is open-ended and creative. Jam them together and the model gives inconsistent diagnoses and misses fixes.

4:31Eric: So KernelPro's central bet is to peel those two jobs apart. Do the "read the profiler like an expert" step with deterministic tools that encode the heuristics — and only then let the model see the result. Instead of telling the model "occupancy is six percent," a tool tells it: "occupancy is critically low at six percent because your shared-memory usage caps you at two concurrent blocks — switch to a warp-level reduction, expected three-to-five-times gain."

5:04Juniper: There are fifteen of these things. They call them micro-profiling tools, and each one encodes a single expert heuristic as a rule with three parts: a trigger, an analysis, and a prescription with an expected-improvement estimate. The output is never a number. It's a severity rating, a root cause, and a ranked list of fixes — sometimes with a code snippet attached.

5:30Eric: My favorite is the stall tool, because it's the clearest picture of what "expert reasoning, codified" actually means. A stall is just a thread sitting idle, waiting on something. The tool reads five different counters for why a thread isn't issuing instructions, finds the dominant reason, and branches the fix to match. Waiting on memory? Cache it in shared memory, use wider vectorized loads. Stuck at a synchronization barrier? Replace the barrier with warp-level primitives. Waiting on a branch to resolve? Replace the if-statement with arithmetic. Same "what kind of stall, and why" reasoning a human runs in their head — written down as a rule.

6:16Juniper: And there's a filtering step that makes this work, which I think is underrated. Before any tool fires, the system runs once and classifies the kernel — is it compute-bound or memory-bound? Meaning, is it limited by raw math throughput, or by how fast it can move data? That one question gates everything. A memory-bound kernel never gets bothered with tensor-core advice. A compute-bound one skips the memory-layout checks. So you get complete coverage of the actual bottleneck without diluting the signal with irrelevant noise.

6:52Eric: Which connects straight back to your blood-panel point, Juniper. The reason raw counters made the model worse isn't mystical. You dumped two hundred numbers on it, most irrelevant to this kernel, and it pattern-matched on the wrong ones — chased a tensor-core "problem" on a kernel that legitimately doesn't need tensor cores. Silence was better because silence at least didn't send it down a false trail. The fix is to pre-digest the panel into a diagnosis.

7:23Juniper: And one more design choice in that spirit — the tools fire proactively, not on request. In a normal agent setup the model decides which tools to call, and it might call three and skip ten, missing the one that mattered. KernelPro instead fires every bottleneck-relevant tool, every iteration. It's the pilot's pre-flight checklist. You don't run the checks you feel like running; you run every check that could matter for this kind of flight, precisely to catch the thing you didn't expect.

7:56Eric: Now, one profiler can't see everything, and this is where it gets concrete. KernelPro reads three of them, and the clean way to hold this is "what can each one see that the others can't." The first, ncu, sees inside the kernel — occupancy, stall reasons, throughput. The second, nsys, sees the system timeline — launch overhead, gaps where the GPU just sat idle between jobs. And the third is the clever one. It's called SASS, and it disassembles the actual compiled machine instructions — what the hardware was literally told to do.

8:33Juniper: Why do you need the literal instructions if you've already got the high-level metrics?

8:40Eric: Because a metric can tell you that something's wrong but not why. The classic case: ncu reports "tensor core utilization, zero percent." Tensor cores are the specialized units that do matrix multiply far faster than general math units — but only if you invoke exactly the right instructions. Zero percent could mean three totally different things. The kernel is element-wise and legitimately doesn't need them. Or it's using an older, slower instruction generation. Or — and this is the killer — it's silently falling back to scalar code while looking fine from the outside. The metric can't tell those apart. Reading the actual instructions can.

9:25Juniper: It's checking the receipt against the kitchen. The bill says the dish was served. Reading the SASS is walking into the kitchen to watch whether they actually cooked it or quietly plated something else.

9:38Eric: And here's the payoff for that, from their production run later on — the SASS tool flagged thirty-seven candidate kernels that compiled fine, ran fine, produced correct answers, and emitted zero tensor-core instructions. Thirty-seven silent scalar fallbacks. From the outside, indistinguishable from a real, slow-but-working kernel. No utilization metric on earth could have caught those. Only reading the machine code could.

10:09Juniper: So that's the interpretation layer. But there's a second half to KernelPro, and it's the part that lets it keep going when things fail — which they do, constantly. The technical core is next: the search that wraps the whole system. It's the densest stretch, and it pays off in a single concrete fact — a task where the system failed forty-three times in a row before it wrote one line that worked, and didn't give up.

10:37Eric: The search is Monte Carlo Tree Search — MCTS — the algorithm famous from Go engines. The idea in one breath: you're facing a tree of possible move sequences far too big to explore fully, so you grow it selectively. You push effort down branches that have been paying off, but you keep some budget to probe neglected ones, because the richest vein might be down a passage that looked unpromising. Explore versus exploit, balanced.

11:06Juniper: Think of it as caving with a finite supply of rope and headlamp battery. You can't map every tunnel. So you go deeper down the passages that keep opening up, but you spend a little checking the gaps you've ignored.

11:20Eric: And the reframe that forces all their custom machinery — in a normal LLM tree search, a node is a cheap reasoning step, a branch of tokens. Here, every single node is a complete kernel. To create one child you generate code, compile it, run it, and profile it. Nodes are expensive, real artifacts. That changes everything about how you're allowed to grow the tree.

11:45Juniper: So how do they grow it without burning the whole budget? You can't afford to expand every option at every node.

11:52Eric: Two adaptations carry most of the intuition. First, asymmetric branching. A node that's a working kernel — something you can optimize — is allowed more children than a node that's broken and needs repair. That encodes a real observation: the space of valid optimizations is wide, but the space of valid repairs is narrow. Found a good room, explore it generously; hit a collapse, give it a couple of digs and move on. In fact a broken node gets three repair attempts, and if all three fail it's marked dead and abandoned.

12:28Juniper: And the second one is the reward, which is where there's a genuinely elegant move. When a kernel works, its score isn't its raw speedup — it's the logarithm of the speedup. And the reason is the part I'd repeat to a friend. A hundred-times speedup on garbage code that was trivially terrible to begin with would otherwise dominate the entire tree and drag the search toward easy junk. Taking the log compresses those giant easy wins.

12:58Eric: It's grading effort, not raw distance. Getting an elite marathoner two minutes faster is a real achievement. Getting a couch potato to jog a mile is impressive — but it's a different category, and you shouldn't let it outrank the marathoner just because the raw number's bigger.

13:17Juniper: Exactly. A two-times gain on already-tight code is arguably harder than fifty-times on a mess, and log-scaling reflects that. But the reward does a second, sharper job too. A correct kernel with no improvement scores zero. Incorrect output scores minus two. A crash scores minus three. So a correct-but-mediocre kernel always beats an almost-correct broken one — there's a hard wall at correctness. The search can never get seduced into polishing something that doesn't actually work.

13:50Eric: And there's an append-only memory the system keeps across iterations, so it doesn't repeat dead approaches or lose a file path it already found. I'll flag — that one didn't really earn its keep, and we'll come back to it.

14:05Juniper: So where we are: split interpreting the profiler from writing the code, hand the interpreting to deterministic expert tools, fire only the ones relevant to the actual bottleneck, and wrap the whole thing in a tree search that's patient through failure and refuses to be fooled by easy wins. That patience is concrete — on one task, the NetVLAD kernel, the system failed forty-three straight times before its first working version. Then profiling-guided refinement took it from below baseline to nearly three-times faster in two steps. A fixed single-shot approach would have quit at attempt one.

14:47Eric: So does the whole thing actually hold up? On KernelBench — that's the standard benchmark, two hundred fifty kernel tasks across three difficulty tiers, from single operations up to full model architectures — the prediction from all this is that comprehensive, interpreted feedback should beat both raw data and silence. And it does. Remember the ablation ladder: raw counters, one-point-eight. Nothing, three-point-three. The full tool pipeline, four-times. The tools alone add a hundred-and-twenty-five percent over the no-feedback version.

15:26Juniper: And the headline benchmark numbers — geometric-mean speedups of roughly two-and-a-half, four-and-a-half, and five-plus times across the three levels, with the prior best system sitting around one-and-a-half, two-and-a-half, and one-and-a-half. The hardest tier, full architectures, is where the margin is widest. Every component was ablated independently with proper significance testing — tools, the tree search, the proactive firing — and the tree search alone beats greedy by twenty-six percent on average, and on individual tasks by up to ten-times.

16:05Eric: Before this sounds like a clean sweep, I want to put the real caveat on the table now, plainly, because a sharp viewer is already forming it. Two things. One — every speedup here is measured against PyTorch eager, the unoptimized default. Some of those reference implementations are trivially slow, which is how you get hundred-times speedups in the first place. "Five-times over eager PyTorch" is a genuinely easier claim than "five-times over a competently optimized baseline." To their credit, they report capped tables and the gains survive — but it matters.

16:42Juniper: And the second?

16:43Eric: The cross-system numbers aren't apples-to-apples, and the authors say so themselves in the appendix. Competing systems run on different GPUs with different optimization ceilings, report different kinds of averages, and test on different task subsets — one of them even trains on most of the benchmark and tests on the rest. So the advantage over prior work is real, but the precise multipliers — that "two-hundred-fifty-percent better on Level 3" framing — those are not a fair head-to-head. I'd trust the internal ablations far more than the leaderboard comparisons, because the ablations vary one thing at a time on the same setup.

17:24Juniper: That's fair, and I think the strongest evidence isn't the benchmark anyway — it's the production case study. This is the payoff. There's a kernel inside a real training system called VeOmni — a weight-gradient kernel for a mixture-of-experts model — that had been hand-tuned in Triton by expert engineers. KernelPro was pointed at it cold, told to write a kernel from scratch in raw CUDA.

17:49Eric: And watch the trajectory, because this is the thing to actually look at. The climb.

17:55Juniper: The first attempt comes in at about fourteen-times slower than the expert baseline. A disaster. But each step, the profiler surfaces the specific bottleneck, the tools diagnose it, the code gets rewritten against that exact problem. And the curve climbs — fourteen-times slower, then roughly even, then just under the baseline, then past it. Eighteen iterations in, it lands at one-point-two-three times faster than the kernel the human experts wrote. And it found sixteen other distinct correct kernels that also beat the baseline.

18:31Eric: And the detail that makes this more than a number — the winning kernel contains zero library calls. No cuBLAS, no CUTLASS template. It's hand-composed from the lowest-level primitives. This is the boldest claim in the paper: most automated systems pick from a menu — they instantiate a high-level template, choosing tile sizes from preset options. KernelPro composed raw source from CuTe building blocks, the way an expert does when no template fits.

19:01Juniper: And there's one decision in that run that I keep coming back to. At one point the system reasoned that a particular fast memory-transfer instruction would mis-read a neighboring expert's data under the model's irregular routing — and it deliberately chose an older, slower-looking technique instead, because the fast one would have been wrong. That's an architectural trade-off normally reserved for human kernel engineers. It didn't just stumble into working code; it reasoned about why the obvious fast path was a trap.

19:37Eric: It's a strong existence proof. I want to be precise about what it proves, though — because this is where I land as the skeptic. It's one kernel, on one GPU, at one-point-two-three times. That's a modest margin, the kind where run-to-run variance and how hard the human tried both genuinely matter. It's an N-of-one production result. It absolutely demonstrates the system can match and edge past an expert on a real problem — that's not nothing. But "beat the humans" as a general claim rests on a single data point, and I'd want a dozen before I retired anyone.

20:16Juniper: That's a fair line, and I won't push past it — one kernel is one kernel. Though I'd say the existence proof is the hard part. Getting from "can't beat experts ever" to "beat them once, by reasoning like one" is the qualitative jump; closing from one to a dozen feels more like turning a crank.

20:36Eric: Maybe. Crank-turning is where a lot of methods quietly die, though.

20:40Juniper: Granted. There's a coda worth a minute, because it's a different idea entirely — energy, not speed. The authors built an energy-aware version, and the question is whether you can make a kernel cheaper to run without making it faster.

20:57Eric: And on a kernel called Swish, two versions came out at identical speed and identical memory traffic — but one drew about twelve percent less energy. Same arrival time, less power. The mechanism is the vivid part: the efficient version compiled to fifty-six instructions instead of two hundred sixteen. It chose a fast approximate reciprocal over the IEEE-accurate division.

21:22Juniper: And here's why that's free on time but not on power. The kernel is bandwidth-bound — it's waiting on memory. So those extra hundred-sixty instructions execute inside time the chip was going to spend stalled anyway. No wall-clock cost. But the silicon still burns power running them. It's two cars arriving at the same red light at the same moment — one with its engine needlessly revving the whole way there.

21:50Eric: And to be honest the way they want it, this is preliminary. One data point, one simple kernel. And the kicker — the win came from the reward's instruction-count term, not from any of the four dedicated energy tools they built for exactly this. Those didn't even fire on the winning case. So "first energy-aware kernel agent" is a real direction resting on a single demonstration. They say as much.

22:16Juniper: Same honesty showed up in the ablations, actually — that search memory you flagged earlier? Listed as a headline feature, it added six percent and didn't clear significance. They reframe it as helping early convergence rather than final quality, which is the honest read, but it's weaker than the billing.

22:35Eric: Which I actually count in their favor. A paper that tells you which of its own ideas underperformed is a paper I trust more on the ideas that did.

22:45Juniper: So the big idea to walk away with. It's not the four-times speedup, and it's not even the one kernel that beat the experts. It's the reframe underneath. When you connect a language model to a rich technical domain, the instinct is to give it maximal raw access and trust it to sort everything out. This paper is a measured, statistically defended argument that the instinct is backwards — that raw data without interpretation is noise that actively misleads, and the productive move is to build a deterministic layer that pre-digests the measurements into the kind of diagnosis a human expert would form. Diagnose first, then prescribe. And the authors point out it generalizes — database tuning, compiler optimization, anything with measurable metrics, known failure signatures, and known fixes.

23:36Eric: Which leaves a real question to chew on. KernelPro's whole edge comes from hand-encoding fifteen expert heuristics as rigid rules — a checklist, not judgment. So where do you land: is the future of AI in specialized domains exactly this, models wrapped in carefully built interpretation layers that hand them expert-shaped diagnoses — or is hard-coding human heuristics a crutch we'll look back on, and a big enough model should eventually learn to read the raw counters itself? If you've ever hand-tuned a kernel, you probably already lean one way — say which, and why.

24:20Juniper: The full annotated version of this episode is on paperdive.ai — every technical term tap-to-define, with links to the related papers grouped by theme, plus our weekly and monthly roundups.

24:35Eric: Quick housekeeping: this script was written by Anthropic's Claude Opus 4.8, Juniper and I are both AI voices from Eleven Labs, and the producer isn't affiliated with either company. The paper is "Optimizing CUDA like a Human" — KernelPro, out of Amazon, posted June twenty-fourth, twenty twenty-six, and we recorded this two days later.

25:01Juniper: The trick was never giving the model a bigger pile of numbers. It was teaching it where to look. We'll see you in the next one.