All episodes

Episode 172 · Jun 25, 2026 · 22 min

One Bad Token Can Sink a Model's Math, And You Can Delete It

Ko, Kang, Lee

LLM Reasoning Interpretability

AI Papers: A Deep Dive — Episode 172: One Bad Token Can Sink a Model's Math, And You Can Delete It — cover art

paperdive.ai

Listen

Ep. 172

One Bad Token Can Sink a Model's Math, And You Can Delete It

0:00

22 min

Concepts in this episode

AI Alignment Training Methods Evaluation & Benchmarks Token-Level Analysis Math Reasoning Trajectory Analysis Causal Intervention DPO Rollout Sampling Pass@k Metric Credit Assignment Emergent Behavior Self-Correction

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning

Venue

arXiv:2606.25524

Year

2026

Read the paper

arxiv.org/abs/2606.25524

Also available on

Apple Podcasts Spotify

When a language model botches a math problem, it's often not because it ran out of knowledge — it's because a single word at a single instant tipped a winning solution into a doomed one. This paper proves that token actually causes the failure, shows you can delete it and rescue the whole solution, and uncovers three distinct kinds of failure, two fixable and one baked in for good.

What you'll take away

What a 'cliff token' is — the single word-piece where a recoverable solution tips into a doomed one, with a high plateau then a sheer drop in 'potential'
How a delete-vs-keep resampling experiment proves the token causes the failure rather than just sitting near it
Why cliffs come in three flavors — confident-wrong (deterministic), uncertain (knowledge gap), and sampled-off (bad luck) — and which ones training can fix
The unnerving finding that an 8B and a 600M model walk off the same confident-wrong cliff at the same word, pointing to a baked-in bias scaling doesn't touch
How Cliff-DPO trains on ~33,000 token positions instead of ~5.8 million, cutting training from 112 minutes to 8 while matching or beating the baseline
Where the headline breaks down: there's no cliff to find when a problem is simply beyond the model, and the cheap training hides 4,000 GPU-hours of detection cost

Chapters

01:59Why does deleting one word save it?
03:03How do you measure a doomed trace?
04:21The stray 7 that doomed everything
05:30Crime or chalk outline?
07:53Telling a real cliff from a glitch
11:28Three students, three wrong answers
15:07Fixing one stitch, not the whole shirt
17:40Where the headline stops holding
20:17A microscope, not yet a safeguard

References in this episode

Direct Preference Optimization: Your Language Model is Secretly a Reward Model — The DPO method the episode's Cliff-DPO directly builds on and narrows to a singl
Let's Verify Step by Step — The canonical process-supervision paper that motivates measuring correctness at
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — Establishes the multi-step reasoning traces whose token-by-token fragility this
Self-Consistency Improves Chain of Thought Reasoning in Language Models — The sampling-many-paths-and-aggregating idea that frames the episode's pass@64 v

Full transcript

Also available as a plain-text transcript page.

0:00Cassidy: A grandmaster is forty moves into a winning game. Every piece is exactly where it should be, the position is crushing — and then one move, a single careless pawn push, and it's gone. Not because the rest of the play got sloppy. Because of that one move. Play perfectly for the next twenty turns and you still lose. This paper argues that large language models fail at math the exact same way. There's often one token — one word-piece — where a correct solution tips into a doomed one. And here's the part that stopped me cold: you can delete that single token, hit regenerate, and rescue the whole solution. Sample sixty-four fresh attempts from just before it, and at least one of them now lands on the right answer — for every single problem.

0:48Eric: Quick heads up before we dig in — this is an AI-made explainer, both voices included. And Cassidy, every problem becoming solvable again is the thing that does it for me. That's not "it helps." That's "this one token was the entire problem."

1:04Cassidy: Right. So by the end of this you'll understand three things: what a "cliff token" actually is, how they proved it causes the failure instead of just sitting near it, and the genuinely surprising discovery that these cliffs come in three distinct flavors — two of which you can train away, and one you basically can't.

1:25Eric: And the reason this matters reaches past math. The standard mental model is that a model either knows how to solve a problem or it doesn't, and a wrong answer means it hit the edge of what it learned. This paper plants a flag in the opposite camp for a huge class of failures — the model does know, and one bad word at the wrong instant throws the whole thing. That's an optimistic message, because localized problems have localized fixes.

1:53Cassidy: Which is the puzzle, isn't it? Same model, same question, same knowledge — and you get a mix of right and wrong runs. If the capability is there, why does deleting one word save it?

2:06Eric: That's the whole engine of the paper. When a model generates, it isn't picking one answer deterministically. At every step it produces a probability distribution over the next token, and a sampler draws from it — so the same prompt run twice gives you different paths. Think of a skilled golfer who sinks the putt most of the time but not every time. The skill is there; the execution varies. So when you ask "why did this attempt fail," the honest answer used to be a shrug — the model drifted, lost the thread, accumulated errors. Vague.

2:43Cassidy: And the authors push on a much sharper hypothesis. Maybe failure isn't diffuse at all. Maybe before some specific token, most ways of finishing the problem would have worked — and after it, almost none do. The decline isn't gradual. It's a cliff.

3:00Eric: To even test that, they needed a way to measure "how's this going" at every point in the reasoning. And the measure they use is beautifully blunt. They call it token-wise potential. Picture the reasoning trace as a choose-your-own-adventure book. At any page, potential asks: from this exact page, if I read forward sixty-four different ways, how many of those endings are happy? How many reach the correct answer?

3:28Cassidy: So a potential of eighty percent means the trace is in great shape — most continuations win. And potential collapsing toward zero means the model has painted itself into a corner.

3:40Eric: Exactly. And there's no clever formula here — it's brute force. To get the potential at a token, you literally fork the generation sixty-four times from that point and count the successes. The actual contribution isn't the idea of potential, which they borrow from earlier work. It's the resolution. Prior work measured this at maybe twenty checkpoints along a trace, or sentence by sentence. These authors crank it up to every single token. That's what lets you point at the one word where things went wrong instead of the rough neighborhood.

4:17Cassidy: And the paper has this perfect concrete example for it. The model is factoring the number 1,092. Walk through what happens.

4:26Eric: It's lovely. The model is working through the factorization, and at one specific token it writes a "7" — it inserts an extra factor of seven where the correct path needed a three. Now watch the potential curve on screen. Before that "7," the line is riding high — plenty of ways to still get this right. The instant the "7" lands, the curve falls off a ledge and crashes toward zero. Nearly every continuation from there is now wrong. That stray "7" is the cliff token. One digit, and a recoverable problem becomes a doomed one.

5:03Cassidy: And that curve — the high plateau, then the sheer drop — that's the image to hold onto, because the whole paper lives on it. Now, a skeptic hears "the potential dropped after that token" and immediately says: correlation. Maybe that token just happens to sit where things were already going south. How do they show the token actually causes the failure?

5:27Eric: This is the cleanest experiment in the paper, and it's surgical. They take a failed trace, find the first cliff token, and run two versions. In one, they delete that token and let the model resample from just before it — call it the delete condition. In the other, they keep the cliff token locked in and resample everything after it.

5:50Cassidy: And the results split hard. Delete the cliff token and resample, and pass@64 — remember, that's "out of sixty-four tries, does at least one land" — recovers all the way to one-point-zero. Every problem becomes solvable again. But keep that token in place, and even with sixty-four fresh continuations you're capped somewhere between seventy and a hundred percent. Sometimes sixty-four tries can't escape what one locked-in word already decided.

6:20Eric: And that asymmetry is the proof. If the cliff token were just a bystander, keeping it wouldn't matter — you'd recover either way. The fact that you're trapped while it's present, and free the moment it's gone, says the token is doing the damage. It's the blunder, not the symptom.

6:39Cassidy: There's a nice contrast here with earlier work, too — the "critical tokens" idea from Lin and colleagues. Those flag the point where the chance of success has already hit zero and stays there.

6:51Eric: Right, and that's the difference between the crime and the chalk outline. Critical tokens mark where the trace is already dead — the aftermath. Cliff tokens are the earlier trigger that caused the collapse in the first place. And when they compare head to head, deleting the cliff token recovers the trace better than deleting the critical token in seventeen of nineteen comparable settings. They're catching the cause, not the corpse.

7:19Cassidy: Now I want to flag something before we go further, because a sharp viewer is already forming this objection. What they're showing is recovery on failures that had a cliff to find in the first place. That word "recoverable" is doing quiet work — and the boundary it draws matters a lot later.

7:38Eric: It does, and we'll give it its full due. But first there's a real problem hiding inside that potential measurement, and how they solve it is the methodological heart of the whole thing.

7:50Cassidy: The problem being that sixty-four samples is... not a lot?

7:54Eric: Not a lot at all. Reasoning traces are wildly volatile — the authors cite a finding that swapping even a single early sentence flips the final outcome more than forty-seven percent of the time. So your sixty-four-fork estimate of potential is jittery. A drop of, say, fifteen percent might be a genuine cliff, or it might just be noise in your own measurement. If you use a fixed cutoff — "flag any thirty percent drop" — you'll hallucinate cliffs all over the noisy regions.

8:24Cassidy: So the technical core is next, and it's worth the climb, because it pays off in the one thing that lets you trust the entire cliff catalog — a way to tell a real cliff from a measurement glitch. Eric, make the abstraction concrete.

8:40Eric: Think about a stock dropping five percent. In a famously stable utility stock, that's alarming — something real happened. In a notoriously jumpy meme stock, five percent is just a Tuesday. To call a move "real," you scale your bar to how noisy that thing normally is. That's the entire idea. Instead of a fixed threshold, they require the potential drop to clear a bar that grows with their statistical uncertainty. On screen, watch the threshold bend: where their estimate is shakiest — right around fifty-fifty, where a proportion is hardest to pin down — the bar rises automatically, from about eighteen percent up toward twenty-four. Where the data is clean, near the extremes, it relaxes.

9:27Cassidy: So it's strict where the measurement is noisy and lenient where it's clean. The drop has to be big and trustworthy.

9:35Eric: That's the one sentence. It's the difference between "the number dropped a lot" and "the number dropped a lot and I believe it." That's what separates their work from the fixed-threshold prior art — and it's a forced move, because they genuinely cannot afford to just run more rollouts to beat down the noise.

9:56Cassidy: Which brings us to the cost, and this number is almost comic. Estimating potential at every token across just 1,610 traces cost about four thousand A100 GPU-hours. Why is a fairly modest analysis that expensive?

10:11Eric: Because the work scales as the square of the trace length. Here's the intuition: to get potential at a token, you generate the rest of the trace sixty-four times. Do that at every token. And early tokens need long rollouts — the whole rest of the solution — while late tokens need short ones. It's like checking a road trip not just at the exits but at every single mile, and at each mile simulating the entire remaining drive dozens of times. The earlier you are, the longer each simulation runs. Sum it up and it balloons quadratically. That cost is exactly why they had to be statistically clever instead of just buying more samples — and, hold this thought, it's why this is a research lens today and not something running live in your chatbot.

11:02Cassidy: So let me consolidate where we are. Failure often hinges on one token; deleting it rescues the trace, which proves cause; and the adaptive threshold is what lets us trust which tokens count as cliffs in a noisy estimate. That's the diagnostic. Now the genuinely surprising part — because it turns out not all cliffs are the same animal.

11:24Eric: This is my favorite turn in the paper. They sort cliffs using two signals. One is the model's entropy at that token — and entropy is just how spread out the model's bets were. Low entropy, it's slamming almost all its weight on one option, confident. High entropy, the weight's smeared across many, it's unsure. The second signal is whether the cliff token was the model's top pick — its greedy choice — or a lower-ranked option the sampler happened to grab.

11:56Cassidy: And those two knobs give three characters. Don't think of them as definitions — think of them as three students getting a quiz question wrong. The first is the confidently-wrong student. Low entropy, picked their top choice, and that choice was the trap. They're sure, and they're sure in the same wrong way every time. That's a deterministic cliff — confident bias.

12:21Eric: The second hesitated. High entropy, still went with their top choice, but they were genuinely torn and leaned the wrong way. That's the knowledge gap — an uncertain cliff. The model didn't really know.

12:34Cassidy: And the third is the most interesting. This student actually knew the right answer — their top choice was fine — but their pen slipped. The sampler rolled the dice and grabbed a worse option than the model itself preferred. Pure bad luck. That's a sampled-off cliff.

12:51Eric: And on screen you can see these three are genuinely distinct, not a story we're imposing. The probability-mass plots show deterministic cliffs piling almost all their weight on the bad token — the model committed. Uncertain cliffs spread the weight but still lean bad. And sampled-off cliffs put most of their probability on the good tokens — the bad one just got drawn by chance. Three different fingerprints.

13:18Cassidy: There's a number that backs the whole taxonomy, too. Ordinary reasoning tokens are the model's top choice ninety-five to ninety-eight percent of the time. Cliff tokens? Only thirty-nine to eighty-two percent. So cliffs disproportionately happen when the model is off its game — but not always. Plenty of cliffs are still confident top choices. That spread is exactly why one explanation won't cut it.

13:44Eric: And now the result that made me sit up. They asked: if a big model and a tiny model from the same family hit a deterministic cliff — the confident-wrong kind — do they fall into the same trap?

13:57Cassidy: And the answer is unnerving. At forty-four of forty-six deterministic cliff positions found by the eight-billion-parameter model, the tiny six-hundred-million one picks the identical bad token. And the reverse holds too — at every one of the small model's deterministic cliffs, the big one mirrors the choice.

14:17Eric: Same trap. Wildly different sizes, and they walk off the exact same ledge at the exact same word. That's not a capability gap — a bigger model is supposed to be more capable. It points to a shared, baked-in bias, something inherited from pretraining that scaling up just doesn't touch. The confident-wrong student doesn't get cured by being a smarter student. They were taught the misconception.

14:45Cassidy: And that sets up the cleanest validation in the paper — the part where the taxonomy stops being a nice story and starts predicting things. Because if these three types are real, they should behave differently when you try to train them away. Eric, the one-line on the training method first.

15:04Eric: Sure. The standard tool here is DPO — Direct Preference Optimization. The plain version: you show the model matched pairs, a good output and a bad one, and nudge it to favor the good and disfavor the bad. No separate reward model, just "prefer this over that." Normally it operates on whole responses — re-sewing the entire shirt because one seam was off.

15:28Cassidy: And Cliff-DPO finds the one bad stitch and reworks only that. At a cliff position, the cliff token is the "rejected" option and a non-cliff alternative is the "chosen" one — and they train on that single token position. Nothing else.

15:44Eric: And the efficiency that buys is the cheeky payoff. The best variant matches or beats the standard baseline while updating about thirty-three thousand token positions instead of nearly five-point-eight million. That's well over a hundred times fewer — about a hundred and seventy-seven. Training drops from a hundred and twelve minutes to eight.

16:07Cassidy: With real accuracy gains — on one math benchmark, jumping from fifty-seven up to sixty-three-and-a-half. So concentrating effort on the handful of tokens that actually mattered isn't just cheaper, it works.

16:21Eric: But the part that validates the taxonomy is the asymmetry — and you predicted it from the characters. Which type carries a useful training signal?

16:31Cassidy: This is the satisfying one. Training on the uncertain cliffs and the bad-luck cliffs helps — and generalizes out of domain. Training on the deterministic, confident-wrong cliffs gives essentially nothing out of domain — and worse, throwing them into the mix actively hurts.

16:49Eric: Which is exactly what the analogy predicted. You can coach away the shaky guess, and you can fix the slipped pen. But the confident, baked-in misconception — the one that's identical across scales — you can't just nudge that away with a few preference pairs. The taxonomy isn't decoration. It tells you which failures are fixable and which aren't.

17:13Cassidy: That's the irony I keep coming back to. The failures that look most like "the model is broken" — confident and wrong — are the ones training can't help. And the one that looks like dumb luck, the sampler slipping, is the one that responds best. The fixable problem was the one we'd have written off as random noise.

17:34Eric: Okay. So that's the case at its strongest. Now I want to take that headline — "delete one token and the trace recovers to a hundred percent" — and say carefully where it doesn't hold. Because the framing is more energetic than the result.

17:50Cassidy: This is the boundary you flagged earlier.

17:53Eric: It is. That perfect recovery only happens in cases where a cliff token existed to begin with — and that's a narrower world than it sounds. They ran sixty incorrect traces from one model on a hard competition benchmark, AIME, and found no cliff tokens at all. Fifty-three of them started with zero potential — the model never had a shot from the first token. When a problem is simply beyond the model, there's no cliff to find, because there was nothing to fall off of. So the real claim isn't "we can rescue failed reasoning." It's "we can rescue the failures that were recoverable in the first place" — which is exactly the set where recovery was always possible. The method describes the failures it can describe.

18:40Cassidy: That's fair, and it's a meaningful narrowing of the abstract's energy.

18:45Eric: And there's more underneath it. The whole cliff catalog is detected inside their own noisy, sixty-four-sample estimate of potential. The adaptive threshold is a principled response to that noise — genuinely good statistics — but it's still drawing the line between "real cliff" and "artifact in a volatile region" statistically, not against any ground truth. The catalog inherits that uncertainty. On the training side, the eight-minutes-versus-a-hundred-and-twelve comparison is honest about the training step and quietly silent about the four thousand GPU-hours of cliff detection you have to run first. The stitching is cheap; the inspection that finds the bad stitch is the expensive part. And every bit of this is math reasoning, mostly smaller models, with the training done on a single tiny one. Whether cliffs are a general phenomenon or a math-and-small-model phenomenon is genuinely open.

19:43Cassidy: I'll grant all of that. What I think survives it is the conceptual core — that a real, identifiable, causal single-token trigger exists at all, and that failures sort into kinds with different cures. That reframing holds even if the numbers are scoped tighter than the headline.

20:01Eric: Agreed on the concept. I just want the viewer to carry the scope with the concept, not instead of it.

20:08Cassidy: Which is the right note to end on, because the bigger idea here outlasts the method. The durable takeaway isn't "Cliff-DPO" or any single technique. It's a shift in how to think about reasoning failure: a large chunk of it isn't the model hitting the edge of what it knows — it's a localized misstep, sometimes nothing more than the sampler betting against the model's own better judgment. And localized problems invite localized fixes.

20:36Eric: The tantalizing version, which the authors are upfront is just future work — what if you could spot a cliff as it's forming, mid-generation, cheaply, without sixty-four rollouts? Then you could backtrack or branch before the trace dies, instead of diagnosing it afterward. But that depends entirely on a fast cliff predictor that doesn't exist yet. Today this is a research microscope, not a live safeguard.

21:05Cassidy: So here's the question for you. Where should the effort go — building a detector that catches the cliff the instant it forms, so the model can back up before it blunders? Or accepting that some of these failures are just unlucky dice rolls, and fixing the sampling itself so the model stops betting against what it already knows? Pick a side and tell us why in the comments.

21:31Eric: If you want to go deeper, the full annotated version of this episode is on paperdive.ai — every technical term tap-to-define, with links to the related papers grouped by theme, including the "critical tokens" work we compared against.

21:48Cassidy: Quick housekeeping: this script was written by Anthropic's Claude Opus 4.8, Eric and I are both AI voices from Eleven Labs, and the producer isn't affiliated with Anthropic or Eleven Labs. The paper is "Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning," posted June 24th, 2026 — we recorded this the next day, on the 25th.

22:13Eric: Same model, same question — and one word decides it. The trick, it turns out, is knowing exactly which word.