Concept · 8 episode(s)

KV Cache

Definition

The KV cache stores the key and value tensors computed during transformer self-attention so they don’t have to be recomputed for every new token. It’s why autoregressive generation is fast enough to be useful, and managing it is half of modern LLM serving.

Episodes covering this

179
How DeepSeek Made One User Faster Without Slowing Down the Crowd
DSpark: Confidence-Scheduled Speculative Decoding with
XinCheng, XingkaiYu, ChenzeShao et al. · Peking University / DeepSeek-AI·23 min·Jun 27, 2026
116
Why Streaming Half a Reasoning Chain Beats Sending the Whole Thing
Streaming Communication in Multi-Agent Reasoning
Yang, Xu, Wang et al. · HKUST (GZ)·26 min·Jun 04, 2026
096
How Treating an AI Agent's Execution Like Git Recovers a Coordination Penalty
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
Yu, Chong, Nandi et al. · Northeastern University·22 min·May 28, 2026
085
Why Long-Context Models Might Need Compute, Not Capacity, Before Eviction
Language Models Need Sleep
Lee, McLeish, Goldstein et al. · Carnegie Mellon University·24 min·May 26, 2026
036
Sparse Attention Was the Wrong Frame. Treat It as Geometry Instead.
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
Dehghankar, Asudeh · University of Illinois Chicago·24 min·May 11, 2026
033
Echo: The Paper Arguing You Never Needed a KV Cache for Retrieval
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators
Sridhar, Johansen · California·24 min·May 11, 2026
027
When AI Agents Build the Serving Stack: A Bet on Bespoke Infrastructure
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
Kamahori, Li, Peter et al. · University of Washington·30 min·May 08, 2026
016
Why Your Coding Agent Stalls While the GPU Runs Hot
MARS: Efficient, Adaptive Co-Scheduling for Heterogeneous Agentic Systems
Wang, Ye, Xu et al. · Duke University·24 min·May 03, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.