Concept · 4 episode(s)

Speculative Decoding

Definition

Speculative decoding uses a small fast “draft” model to propose several next tokens at once, which the large “target” model then verifies in a single forward pass — accepted tokens are free, rejected ones fall back to the slow path. It’s one of the biggest practical wins in modern LLM serving.

Episodes covering this

179
How DeepSeek Made One User Faster Without Slowing Down the Crowd
DSpark: Confidence-Scheduled Speculative Decoding with
XinCheng, XingkaiYu, ChenzeShao et al. · Peking University / DeepSeek-AI·23 min·Jun 27, 2026
068
The OS Trick That Makes Tree Search Practical for Coding Agents
DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback
Dong, He, Hou et al. · Institute of Parallel and Distributed Systems·27 min·May 22, 2026
036
Sparse Attention Was the Wrong Frame. Treat It as Geometry Instead.
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
Dehghankar, Asudeh · University of Illinois Chicago·24 min·May 11, 2026
027
When AI Agents Build the Serving Stack: A Bet on Bespoke Infrastructure
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
Kamahori, Li, Peter et al. · University of Washington·30 min·May 08, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.