Literature review · 6 episode(s)

Efficient architectures and long-context

← all topics  ·  Glossary →

Rethinking the KV cache

Two papers attack the from different angles. Reframing as solvable from running turns the cache into an implementation choice: a fixed-size state thousands of times smaller hits 100% on benchmarks where pure score 3%, and gets more accurate with longer sequences E033. Separately, treating as a geometric range-search problem — rather than importing nearest-neighbor tricks from web search, a category error — yields a method faster than with over 99.9% , that occasionally beats dense via a denoising effect E036. Both argue the field imported the wrong toolkit and that the geometry of attention admits better algorithms.

Carrying state between tokens

A second thread restores deliberation to the . Adding a tiny per-layer 'sticky note' to a model — about 330k new parameters trained in six hours on one GPU — yields a 15-point reasoning gain and reveals latent reasoning happening in distinct stable-then-reorganizing regimes E032. Reframing a as a problem makes training memory constant regardless of depth, and produces an emergent : the model learns to put its first guess at the fixed point, so the refinement loop vanishes at inference E041. The recurring finding is that uniform extra iteration can hurt — depth should be a per-question deliberation budget.

Compute, not capacity

A reframe with real deployment value: a 's fixed-size state isn't a storage device but the residue of a one-pass computation, so shallow computation produces shallow residue regardless of capacity E085. Looping compute over context in a '' phase right before the cache clears lifts hard multi-operation problems from ~60% to ~90%, and crucially the extra compute is paid during ingestion, not at answer time, so inference latency is unchanged. The conceptual contribution — splitting inference into a compute-rich ingestion phase and a latency-constrained answer phase — is likely to outlast the specific mechanism.

Pretraining without the brute force

On the side, an existence proof that the compute-to-performance ratio isn't a law of nature: a 1B model trained on 16 GPUs for about $1,500 matches , , and on reasoning, using fast/slow modules that keep deliberating through the final layer plus an objective that grades only response E074. The framing is honest — it's an existence proof rather than a new paradigm, with the data mixture not cleanly isolated and scaling beyond 1B unverified — but the point lands: architectural questions are accessible to small labs again.

Episodes anchoring this topic