Literature review · 6 episode(s)

Efficient Architectures and Long-Context Systems

Reframing the KV cache and sparse attention

Treating retrieval as ridge regression solvable from running sufficient statistics produces a fixed-size state ~5,000× smaller than the equivalent KV cache while achieving 100% on canonical associative recall benchmarks — and accuracy that improves with longer sequences E033. From the opposite angle, reframing top-k attention as a geometric range-search problem rather than approximate nearest-neighbour search yields a 15× speedup over PyTorch attention with over 99.9% recall, and on some reasoning benchmarks actually beats dense attention through a denoising effect E036. Compute-rich 'sleep' phases that loop computation over context before cache eviction lift GSM-Infinite accuracy with no increase in answer-time latency E085, reframing the long-context bottleneck as compute-while-writing rather than memory-while-storing.

The common thread is that long-context inefficiency has been partly a coordinate choice. The performance ceilings that motivated giant models for retrieval are softer than they looked.

Small models, recurrence, and looped computation

Fixed-point attractor models match a 1.3B Transformer at 770M parameters and absorb the iterative refinement procedure into a single forward pass at inference — an emergent self-distillation nobody designed in E041. A 1B-parameter HRM-Text model trained on $1,500 of compute reaches Llama/Qwen/Gemma reasoning territory using fast/slow recurrent modules, smarter normalisation placement, and gradient signal concentrated on response tokens E074. Adding a tiny per-layer 'sticky note' to a frozen Gemma adds 15 points on GPQA-Diamond and surfaces measurable latent-reasoning structure E032.

The takeaway is that the trillion-token pretraining race was solving a problem that smarter architecture and objectives can partially avoid — though the existence proofs at sub-2B scale don't yet guarantee these methods scale to frontier capability.

Compute routing and per-query deliberation

Telling a model to 'think step by step' costs 50× the tokens and often hurts; three statistics over the first 64 tokens (cumulative uncertainty, trend, smoothness) can route between chain-of-thought and direct decoding without training a classifier, cutting average tokens by a third with no loss in accuracy E077. The same per-question logic applies inside attractor and state-stream architectures, where halting probes turn uniform iteration depth into a budgeted deliberation [E032, E041].

Production system papers are converging on a similar bet: sparse activation paired with verifiable-reward investment lets a model match Claude Opus and GPT-5 at one-tenth per-token compute on agentic workloads E090. Efficiency increasingly means deciding what to spend compute on, not just how much.

Episodes anchoring this topic

033-echo-kv-cache-free-associative-recall-with-spectral-koopman-
Reframed the KV cache as an implementation choice, with retrieval solvable from running sufficient statistics.
036-sparse-attention-as-a-range-searching-problem-towards-an-inf
Reframed sparse attention as geometric range search, beating FlashAttention with near-perfect recall.
032-state-stream-transformer-sst-v2-parallel-training-of-nonline
Added per-layer hidden state to frozen models and surfaced basin-shift dynamics in latent reasoning.
041-solve-the-loop-attractor-models-for-language-and-reasoning
Fixed-point attractor architecture that absorbs iterative refinement into a single forward pass at inference.
074-hrm-text-efficient-pretraining-beyond-scaling
Showed a $1,500 1B-parameter recurrent model matching Llama/Gemma on reasoning benchmarks.
085-language-models-need-sleep
Reframed long-context bottlenecks as compute-while-writing and proposed sleep-phase ingestion loops.