Literature review · 6 episode(s)

Efficient architectures and long-context

Rethinking the KV cache

Two papers attack the KV cache from different angles. Reframing associative recall as ridge regression solvable from running sufficient statistics turns the cache into an implementation choice: a fixed-size state thousands of times smaller hits 100% on benchmarks where pure state-space models score 3%, and gets more accurate with longer sequences E033. Separately, treating sparse attention as a geometric range-search problem — rather than importing nearest-neighbor tricks from web search, a category error — yields a method faster than FlashAttention with over 99.9% recall, that occasionally beats dense attention via a denoising effect E036. Both argue the field imported the wrong toolkit and that the geometry of attention admits better algorithms.

Carrying state between tokens

A second thread restores deliberation to the transformer. Adding a tiny per-layer 'sticky note' to a frozen model — about 330k new parameters trained in six hours on one GPU — yields a 15-point reasoning gain and reveals latent reasoning happening in distinct stable-then-reorganizing regimes E032. Reframing a looped transformer as a fixed-point problem makes training memory constant regardless of depth, and produces an emergent self-distillation: the model learns to put its first guess at the fixed point, so the refinement loop vanishes at inference E041. The recurring finding is that uniform extra iteration can hurt — depth should be a per-question deliberation budget.

Compute, not capacity

A reframe with real deployment value: a hybrid model's fixed-size state isn't a storage device but the residue of a one-pass computation, so shallow computation produces shallow residue regardless of capacity E085. Looping compute over context in a 'sleep' phase right before the cache clears lifts hard multi-operation problems from ~60% to ~90%, and crucially the extra compute is paid during ingestion, not at answer time, so inference latency is unchanged. The conceptual contribution — splitting inference into a compute-rich ingestion phase and a latency-constrained answer phase — is likely to outlast the specific mechanism.

Pretraining without the brute force

On the pretraining side, an existence proof that the compute-to-performance ratio isn't a law of nature: a 1B model trained on 16 GPUs for about $1,500 matches Llama, Qwen, and Gemma on reasoning, using fast/slow recurrent modules that keep deliberating through the final layer plus an objective that grades only response tokens E074. The framing is honest — it's an existence proof rather than a new paradigm, with the data mixture not cleanly isolated and scaling beyond 1B unverified — but the point lands: architectural questions are accessible to small labs again.

Episodes anchoring this topic

Echo: The Paper Arguing You Never Needed a KV Cache for Retrieval
Reframed retrieval as regression, making the KV cache an implementation choice.
A Sticky-Note for Every Layer: Letting Transformers Remember What They Were Just Thinking
Added per-layer memory to a frozen model for a 15-point reasoning gain.
Sparse Attention Was the Wrong Frame. Treat It as Geometry Instead.
Recast sparse attention as geometric range search, beating FlashAttention with near-perfect recall.
Why Long-Context Models Might Need Compute, Not Capacity, Before Eviction
Showed long-context is compute-bound, not capacity-bound, with a no-latency-cost fix.
How a Fifteen-Hundred-Dollar Training Run Matched Llama and Gemma on Reasoning
Matched frontier reasoning with a $1,500 1B model via architecture and objective.
When the Iteration Teaches the Model to Skip the Iteration
Made looped reasoning trainable at constant memory with emergent self-distillation.