Literature review · 6 episode(s)

Efficient Architectures and Long-Context Systems

← all topics  ·  Glossary →

Reframing the KV cache and sparse attention

Treating retrieval as solvable from running produces a fixed-size state ~5,000× smaller than the equivalent while achieving 100% on canonical benchmarks — and accuracy that improves with longer sequences E033. From the opposite angle, reframing as a geometric range-search problem rather than approximate nearest-neighbour search yields a 15× speedup over PyTorch attention with over 99.9% recall, and on some reasoning benchmarks actually beats dense attention through a denoising effect E036. Compute-rich '' phases that loop computation over context before cache eviction lift accuracy with no increase in answer-time latency E085, reframing the bottleneck as compute-while-writing rather than memory-while-storing.

The common thread is that inefficiency has been partly a coordinate choice. The performance ceilings that motivated giant models for retrieval are softer than they looked.

Small models, recurrence, and looped computation

Fixed-point attractor models match a 1.3B at 770M parameters and absorb the iterative refinement procedure into a single at inference — an emergent nobody designed in E041. A 1B-parameter model trained on $1,500 of compute reaches // reasoning territory using fast/slow modules, smarter normalisation placement, and signal concentrated on response E074. Adding a tiny per-layer 'sticky note' to a Gemma adds 15 points on and surfaces measurable latent-reasoning structure E032.

The takeaway is that the trillion- race was solving a problem that smarter architecture and objectives can partially avoid — though the existence proofs at sub-2B scale don't yet guarantee these methods scale to frontier .

Compute routing and per-query deliberation

Telling a model to 'think step by step' costs 50× the and often hurts; three statistics over the first 64 tokens (cumulative uncertainty, trend, smoothness) can route between and direct decoding without training a classifier, cutting average tokens by a third with no in accuracy E077. The same per-question logic applies inside attractor and state-stream architectures, where halting turn uniform iteration depth into a budgeted deliberation [E032, E041].

Production system papers are converging on a similar bet: sparse activation paired with verifiable-reward investment lets a model match and at one-tenth per- compute on workloads E090. Efficiency increasingly means deciding what to spend compute on, not just how much.

Episodes anchoring this topic