All episodes

Episode 179 · Jun 27, 2026 · 23 min

How DeepSeek Made One User Faster Without Slowing Down the Crowd

XinCheng, XingkaiYu, ChenzeShao et al.

Speculative Decoding

AI Papers: A Deep Dive — Episode 179: How DeepSeek Made One User Faster Without Slowing Down the Crowd — cover art

paperdive.ai

Listen

Ep. 179

How DeepSeek Made One User Faster Without Slowing Down the Crowd

0:00

23 min

Concepts in this episode

AI Efficiency & Cost Systems for ML Speculative Decoding Parallel Sampling Inference Cost Rollout Sampling Reward Variance LLM Serving KV Cache Token-Level Analysis Admission Control

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

DSpark: Confidence-Scheduled Speculative Decoding with

Venue

raw.githubusercontent.com

Year

2026

Read the paper

raw.githubusercontent.com/deepseek-ai/DeepSpec/main/DSpark_paper.pdf

Also available on

Apple Podcasts Spotify

DeepSeek tore out the fast-text part of its flagship model two weeks into running it — and the replacement makes each user's words come back up to 85% faster while serving the same crowd on the same GPUs. The twist: their winning drafter is the 'dumber' one that guesses words blind, and the whole system works partly because a sloppy production shortcut accidentally made the math more correct. By the end you'll understand the two moves that break a trade-off everyone assumed was iron.

What you'll take away

Why position-one accuracy carries enormous leverage in speculative decoding — and how a 'tall cliff' parallel drafter beats a flat-but-coherent autoregressive one
How DSpark's semi-autoregressive design keeps a deep parallel backbone but adds a tiny cheap correction head to stop the draft's tail from rotting
Why aggressive drafting blew up DeepSeek's last production system, and how making draft length a live, load-aware decision fixes the throughput-versus-latency trade
The causality trap in load-aware scheduling — and how using stale, two-step-old data accidentally restores the lossless guarantee instead of breaking it
The honest critique: the offline quality numbers and the production numbers never meet in one experiment, and the win is partly over a deliberately timid single-token baseline
Why the headline isn't one magic multiplier but a better Pareto frontier — more speed and more users on the same hardware

Chapters

01:30Why the slow part is the bottleneck
02:07The junior, the expert, and free speed
03:00Two camps, both half-right
04:45When the incoherent drafter won
06:36A sliver of memory beats stacked depth
10:07The second bottleneck that killed production
11:51Express lanes that open with traffic
13:43How stale data fixed the cheating problem
16:46Does the whole machine actually hold up?
19:17The catch the paper sometimes blurs

References in this episode

Fast Inference from Transformers via Speculative Decoding — The original speculative-decoding paper and the rejection-sampling foundation DS
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — The autoregressive drafter lineage (Eagle3) that DSpark benchmarks its accepted-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads — The canonical parallel multi-head drafter that exemplifies the 'fire out the who
On Calibration of Modern Neural Networks — Introduces temperature scaling, the exact single-dial calibration fix DSpark app

Full transcript

Also available as a plain-text transcript page.

0:00Cassidy: Two weeks. That's how long DeepSeek ran its newest flagship model before they tore out the part that makes it generate text fast — and bolted in something new. The replacement makes each user's words come back sixty to eighty-five percent faster, on the same GPUs, serving the same number of people at once.

0:21Tyler: Quick heads up before we get into it — this is an AI-made explainer, both voices included. And that pairing is the part that should bug you, Cassidy, because faster-per-user and same-number-of-users usually trade against each other. You speed one person up by handing them more of the machine — which means fewer people fit.

0:42Cassidy: Right — and the system that breaks that trade is called DSpark. By the end you'll understand the two moves it makes, and the strange one is this: their faster drafter is, in one sense, the dumber one. It guesses a whole run of words blind, none of them looking at each other — and it beats the careful model that reads each word before guessing the next.

1:06Tyler: Which sounds backwards. And it matters past DeepSeek's datacenter, because the thing slowing down every chatbot and every agent isn't how much math the chip can do — it's that the model is forced to write one word, wait, write the next. Whether you can serve fast responses to a crowd comes down to beating that.

1:27Cassidy: So start with why it's slow at all. A language model writes one token at a time — a token's roughly a word-ish chunk. To produce each one, it re-reads everything written so far and does a full pass through itself. A five-hundred-token answer is five hundred of those passes, each waiting on the last.

1:47Tyler: And the cruel part is the hardware. A GPU is built to do thousands of things at once, but that one-at-a-time chain means it does one small thing, sits idle, does the next. You're paying for a parallel monster and using it like a typewriter.

2:03Cassidy: The fix everybody already uses is speculative decoding. Picture a fast junior assistant and a senior expert. The junior drafts the next several words in a blink. The expert — the big model — reads that whole draft in a single glance and checks it, which is exactly the cheap, bulk work a GPU's good at. Whatever the expert would've written anyway, you keep. Guess right, you got several words for the price of one expert pass.

2:33Tyler: And there's a piece of math underneath — rejection sampling — that makes it exactly lossless. The accept-or-reject rule on each token is rigged so the final text carries the identical statistical fingerprint the big model would've produced on its own. Free speed, no quality cost. That's the foundation, and DSpark doesn't touch it.

2:56Cassidy: It builds on it. And the open question the field's been chewing on is how to get that junior drafter to propose longer, better runs of words — because that's where the speedup lives. Here's where it split into two camps that each got it half-right. Camp one, autoregressive drafters: the junior writes one word at a time, each seeing the ones before it. Coherent — but sequential, so it has to stay shallow and short. Camp two, parallel drafters: fire out the whole block at once, every position guessing independently, nobody seeing what their neighbors picked. Fast, can be deep — but the guesses don't coordinate.

3:39Tyler: Uncoordinated how — like the words just don't fit together?

3:43Cassidy: Exactly that. The paper's own example: say the next two words could be "of course" or "no problem," both fine. Position one, hedging across the options, picks "of." Position two, independently, picks "problem." Each made a reasonable local call — and you get "of problem." Nonsense, because the two positions never agreed on which phrase they were building. That's multi-modal collision, and it's why a parallel drafter's accuracy crumbles after the first couple of tokens.

4:16Tyler: So the obvious scoreboard says autoregressive wins — coherent beats incoherent. Slower, sure, but at least it isn't emitting "of problem."

4:26Cassidy: That's the conventional wisdom. And when DeepSeek actually measured it, the parallel drafter won. That result is the hinge the whole architecture turns on — and it took a new way of measuring to even see it. Here's the measurement they invented — call it position-wise conditional acceptance. For each spot in the block, ask: given that every word before it was already accepted, how often does the word at this position survive the expert's check? That separates how good the drafter is at position four from the fact that positions one through three might've already tanked.

5:07Tyler: And on screen, this chart is the heart of the paper — acceptance against position. Walk through what the two curves actually do.

5:16Cassidy: The parallel drafter starts way up high at position one — much higher than the autoregressive one — because it can afford to be deep and smart there. Then it decays fast; by position four it's rotting. The autoregressive drafter starts lower but stays flat, even drifts up. Two very different shapes: a tall cliff versus a low plateau.

5:39Tyler: And the cliff wins because of the relay-race logic — it's a strict prefix.

5:44Cassidy: That's the key. Speculative decoding accepts the block as a prefix — the instant one token is rejected, everything after it gets thrown away, however good it was. It's a relay where if the first runner drops the baton, all the downstream effort is wasted. So position one carries enormous leverage. A drafter that's excellent at position one and mediocre later beats one that's evenly okay — because that first leg decides whether you get anything at all.

6:16Tyler: So the parallel model's first-token firepower dominates even though its tail rots. The fix writes itself — keep the deep parallel backbone for that position-one strength, and find some cheap way to stop the tail from rotting.

6:32Cassidy: And "cheap" is the whole game. That's the semi-autoregressive design. Keep the expensive work fully parallel — the backbone spits out a first-draft set of probabilities for every position in one shot, capturing everything except coordination between words. Then a tiny sequential head walks left to right and adds a correction at each position based on what was actually just sampled. Back to "of": the head's job is to learn that once position one landed on "of," position two should lean "course," not "problem."

7:05Tyler: So it's the two people finishing each other's sentences — except now person two gets to glance at the word person one actually wrote before committing.

7:15Cassidy: Precisely. And the cheap version — the Markov head — is basically a lookup table: given the previous token, here's a little vector of nudges to re-weight what comes next. They keep it low-rank, so even with a hundred-thousand-word vocabulary it stays tiny. Which matters, because if the sequential part got expensive, you'd have thrown away the whole reason you went parallel.

7:39Tyler: And there's a constraint that killed earlier attempts at this, right? People have tried to fix parallel incoherence for years.

7:47Cassidy: They have, and most of those fixes are illegal here. Old parallel-coherence tricks use global normalization or marginalize over hidden variables — and that wrecks your ability to read off an exact probability for each individual token. But the rejection-sampling rule needs exact per-token probabilities or the lossless guarantee collapses. DSpark's trick is staying strictly local — each correction depends only on the actual previous token — so clean softmax probabilities survive. It's shaped as much by what's mathematically permitted as by what's smart.

8:23Tyler: So how much does this sliver of sequencing actually buy you?

8:27Cassidy: A two-layer DSpark beats a five-layer pure-parallel drafter across every domain they tested. A little token-to-token memory does more than stacking parallel depth. Against the autoregressive Eagle3, accepted length climbs around twenty-seven to thirty-one percent; against the parallel backbone it builds on, sixteen to eighteen. And the overhead is almost nothing — pushing the draft from four tokens out to sixteen adds between a fifth of a percent and just over one percent to the round-trip, while delivering up to thirty percent more accepted tokens.

9:05Tyler: Because the expert's verification pass dominates the compute anyway, so the little head rides for free.

9:12Cassidy: Right. That's act one — draft better. But there's a second bottleneck, and it's the one that actually killed their last production system.

9:22Tyler: Worth flagging one thing about those numbers before we move on, though. Every figure Cassidy just gave — the accepted-length gains — comes from tests with the scheduler switched off, on Qwen and Gemma models, to isolate raw draft quality. The sixty-to-eighty-five-percent production win comes from the full system on DeepSeek-V4. Hold onto that gap. No single experiment shows the whole machine end to end — and that comes back to bite later. Here's the formula that organizes everything. The latency for each token you actually keep is the draft time plus the verify time, divided by how many tokens got accepted. Three knobs: draft faster, draft better — that's the bottom of the fraction — or make verification cheaper. Act one pushed on "draft better." Act two goes after the verify term, and that's where this turns into a systems paper.

10:21Cassidy: And the naive move would just be — verify everything the drafter proposes. Longer drafts, more free tokens.

10:28Tyler: That's exactly the move that blew up on them. Verifying a token isn't free — it occupies batch capacity. Under light load, checking an extra speculative token costs basically nothing; the GPU's got room. Under heavy load, that extra check is capacity you just stole from another user in the queue. So a long, aggressive draft that's a clear win for one person can drag down throughput for the whole crowd.

10:56Cassidy: That's the throughput-versus-latency tension. Throughput is how many people you serve at once; per-user latency is how fast your own answer feels. Long drafts buy the second by spending the first.

11:10Tyler: And that's why DeepSeek's previous production setup was the most timid thing possible — a single-token drafter. They tried more aggressive static drafters and shelved them, because at high concurrency they strictly hurt throughput. They left speed on the table on purpose. DSpark's second idea is to stop treating draft length as a fixed dial and make it a live, load-aware decision.

11:36Cassidy: So picture express lanes that open and close with traffic. Quiet server — verify five or six tokens per user, nearly free. Server slammed with concurrent requests — shrink the budget back so those checks don't rob everyone else. Same instinct as opening highway lanes off-peak and closing them at rush hour.

11:57Tyler: And to make that call it needs two things — a guess at how likely each draft token is to survive, and a read on how busy the hardware is. The first comes from a confidence head: one small layer that outputs, per position, the chance that token survives, given all the earlier ones did. Those estimates have a known flaw, though — neural nets rank well but run overconfident. The head can correctly tell you token A beats token B while claiming ninety-five percent when the truth is eighty. Raw, it was a good ranker — discrimination in the low-to-high eighties — but its calibration error ran three to eight percent. So they tune it with a single dial, temperature scaling, that softens the inflated numbers until they match reality, dropping the error to about one percent. The scheduler has to trust the magnitudes, not just the order.

12:55Cassidy: Now the scheduler has trustworthy survival odds for every candidate token across every user in the batch. And deciding how far to verify each one looks like a brutal combinatorial problem — but it collapses into something almost embarrassingly simple. And the reason it collapses hides a correctness trap that, when they fixed it in production, accidentally made the theory more correct, not less. That's the next stretch — the densest part — and it pays off in that twist. Start with why it's simple. Extending any draft can only lower its cumulative survival odds — each extra token is one more chance to fail, multiplied onto the running product. So the value of adding any one token is just that token's own survival probability. Which means you don't reason about whole blocks. Throw every candidate token from every user into one pile, sort by survival probability, and admit from the top down — like boarding the passengers most certain to actually fly first — until adding the next one would slow the plane enough to hurt total throughput.

14:08Tyler: And "hurt throughput" isn't a guess — they profile the engine once at startup, a little table of how fast it runs at each batch size, and read the tradeoff straight off it. Total throughput is expected accepted tokens times steps-per-second at the current load. Greedily admit, stop when that product peaks. But here's the trap Cassidy flagged. Speculative decoding has to be non-anticipating — your decision to verify token k can't depend on token k's own value. Peek at the token to decide whether to check it, and you've leaked information that breaks the lossless guarantee.

14:50Cassidy: And the bet analogy is the cleanest way in. Judging whether to make a bet by peeking at whether it happened to win — instead of the odds when you placed it — is cheating. The confidence head computes the next survival odds using the token that was just sampled. So a retrospective sort across the whole batch could let a token's own outcome sneak into whether it gets admitted. The appendix has a clean case where that bias drags a true seventy-thirty output distribution to eighty-five-fifteen. No longer lossless.

15:28Tyler: In the clean algorithm the early-stop break saves you — it halts before it ever evaluates anything that depends on the peeked-at token. Causality preserved. Then they deployed it, and reality was messier. The clean argument assumes a smooth capacity curve; real GPU capacity is jagged and step-wise. And worse, the scheduler needs the next batch size before the current step even finishes — a hard constraint of their zero-overhead scheduling. Their fix: approximate the load using confidence outputs from two steps earlier. Stale data.

16:08Cassidy: And here's the twist — the stale data fixes the causality problem for free.

16:14Tyler: It does. Because the admission decision now depends only on already-finished history — outcomes fully realized two steps ago — a token's own value can never leak into the choice to admit it. The shortcut everyone would assume degrades correctness is the exact thing that restores it. A practical compromise that makes the theory cleaner, not dirtier.

16:39Cassidy: So where we are: act one made the draft coherent without paying the autoregressive tax, act two spends verification budget only when the hardware has it spare. Put both in front of real traffic — what happens?

16:54Tyler: If the load-aware story is right, you'd predict the verification budget should breathe with traffic — wide when it's quiet, tight when it's busy. And that's exactly what the telemetry shows: from the old static two tokens out to four-to-six per request under light load, smoothly shrinking as concurrency climbs. The per-user payoff — sixty to eighty-five percent faster generation on V4-Flash, fifty-seven to seventy-eight on V4-Pro, at matched throughput against the single-token baseline they replaced.

17:31Cassidy: And the reason adaptivity earns its keep: how many tokens you can safely draft swings hard by task. Structured work like math or code — the drafter nails long runs, around five and a half accepted tokens. Open-ended chat is choppier, closer to three and a half. On chat especially, turning up the confidence threshold to prune the doomed tail tokens lifts the acceptance rate from about forty-six percent to ninety-six. A static budget can't track that swing; a load-and-confidence-aware one can.

18:07Tyler: The cleanest way to see the whole thing is the Pareto frontier — that menu of best tradeoffs. Throughput on one axis, per-user speed on the other; every operating point you could pick sits under one curve. DSpark's curve sits outward and to the right of the baseline's — strictly more speed and more users on the same hardware. That's the honest headline: a better menu, not one magic multiplier. And I want to flag a number the paper itself disowns. At the very strictest latency targets, the charts show throughput gains like four hundred and six percent, even six hundred and sixty-one. Read as a seven-times speedup, that's wrong — and the authors say so. What's actually happening is the old baseline falls off a cliff at those strict targets — it can barely sustain a tiny batch — so the ratio explodes against a number near zero. The real claim is the cliff itself: DSpark keeps running where the old system simply couldn't. And that honesty is the right note to get critical on, because the paper has a real soft spot. Remember the gap I flagged — the offline quality numbers run with the scheduler off, on Qwen and Gemma; the big production numbers run the full system on V4. No single experiment isolates the scheduler's own contribution on reproducible, public models against a strong competitor. You're asked to trust that the two halves compose, and that link is asserted more than it's shown.

19:47Cassidy: That's fair — though the production deployment is unusually real evidence. This isn't a benchmark with one request at a time; it's serving live traffic at scale.

19:58Tyler: It is, and that counts for a lot. But look at what it's compared against — a single-token drafter, the most cautious setup possible. They admit they never deployed an aggressive static multi-token drafter because it hurt throughput. So part of DSpark's win is over a deliberately timid baseline, not a well-tuned aggressive one. And the deployed scheduler isn't even the clean algorithm — it drops the early-stop, uses stale estimates, and runs on that jagged capacity curve the optimality argument assumed away. So the thing in production is a heuristic whose optimality is shown empirically, not guaranteed. It clearly works. But "it works in our datacenter" and "we proved it's near-optimal" are different claims, and the paper sometimes lets them blur.

20:49Cassidy: I'll grant the blur, Tyler. What I think they've earned cleanly is the reframe — and there's one more thing they were honest about: the RNN head. They built a fancier sequential head too, one that carries memory of the whole prefix instead of just the previous token. It was pitched as the more powerful option, and it barely beat the simple lookup table — mostly at long blocks — so they dropped it from deployment. Honest to report, but it means one of their two architectural bets didn't earn its place. But step back to the durable idea, because it outlives this one system. Two reframes. First, the place draft-model capacity pays off most is position one — first-token leverage — and a sliver of token-to-token memory is enough to save the rest. Second, and bigger: how aggressively to verify isn't a setting you tune once and forget. It's a live decision the server should keep remaking as load changes. That second idea is the one other inference systems will borrow, whatever happens to DSpark itself.

21:56Tyler: And that points at the one cost DSpark still can't dodge. It always pays to draft that first block through the parallel backbone — and for a genuinely hard query where almost nothing gets accepted, that compute is just burned. So here's the split I'd put to you. Is the future more of this — serving systems getting ever smarter about spending a fixed draft budget across a crowd? Or does the real next gain come from refusing to pay that upfront draft cost at all, on the queries a model can already tell are hopeless? If you've run inference at scale, you probably already lean one way — say which, and why.

22:41Cassidy: If you want to go deeper, the full annotated version of this episode is on paperdive.ai — every term tap-to-define, with links to the related papers grouped by theme, plus the weekly and monthly roundups.

22:57Tyler: Quick housekeeping: this script was written by Anthropic's Claude Opus 4.8, Cassidy and I are both AI voices from Eleven Labs, and the producer isn't affiliated with either company. The paper is DSpark, out today — June 27th, 2026 — from DeepSeek and Peking University. The trick was never just drafting more. It's knowing when the road's clear enough to open the extra lane. See you in the next one.