KV cache

Definition

Plain language

The model's short-term memory of everything it has read so far in the current conversation.

As stated in the literature

Stored attention keys and values from previous tokens that a transformer reuses during generation; its size grows linearly with context length and often dominates inference memory.

Also called: KV-cache, key-value cache, KV caches

Why it matters: Managing the KV cache is the single biggest determinant of how much long-context inference costs in memory and dollars.

For example, when generating the 1000th token of a response, the model reuses the cached keys and values from the previous 999 tokens instead of recomputing them.

Heard on the show

“When a transformer reads a sequence of tokens, it keeps a record of every token in something called the KV cache.”

Episode 085 — Why Long-Context Models Might Need Compute, Not Capacity, Before Eviction

Definition

Heard on the show

Mentioned in 6 episodes

Related concepts

Related terms