Definition
The model's short-term memory of everything it has read so far in the current conversation.
Stored attention keys and values from previous tokens that a transformer reuses during generation; its size grows linearly with context length and often dominates inference memory.
Also called: KV-cache, key-value cache, KV caches