speculative decoding

Definition

Plain language

A speed trick where a small draft model proposes several tokens at once and a big model verifies them in parallel.

As stated in the literature

An inference acceleration technique where a draft model proposes a sequence of tokens that a larger target model verifies in a batched forward pass.

Also called: speculative decoder, predicted outputs

Why it matters: It cuts inference latency substantially without changing the final model's output, making it one of the most popular tricks for serving large models cheaply.

For example, a tiny draft model proposes the next eight tokens of an answer, and the big model checks them all in a single forward pass — keeping the ones that match its own predictions.

Heard on the show

“The fix everybody already uses is speculative decoding.”

Episode 179 — How DeepSeek Made One User Faster Without Slowing Down the Crowd

Definition

Heard on the show

Mentioned in 5 episodes

Related concepts

Related terms