Definition
The KV cache stores the key and value tensors computed during transformer self-attention so they don’t have to be recomputed for every new token. It’s why autoregressive generation is fast enough to be useful, and managing it is half of modern LLM serving.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.
- Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
- Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
- Zoology: Measuring and Improving Recall in Efficient Language Models
- SnapKV: LLM Knows What You are Looking for Before Generation