Concept · 4 episode(s)

KV Cache

← all concepts

Definition

The KV cache stores the key and value tensors computed during transformer self-attention so they don’t have to be recomputed for every new token. It’s why autoregressive generation is fast enough to be useful, and managing it is half of modern LLM serving.

Episodes covering this

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.