Literature review · 6 episode(s)

Efficiency, architecture, and long-context inference

On this page

Retrieval reframed: beyond the KV cacheTwo papers argue the KV cache and approximate-nearest-neighbour sparse attention were both solving the wrong problem.
Looped and recurrent architecturesReasoning gains can be bought by giving transformers state — between tokens, across iterations, or in a 'sleep' phase — without paying at inference time.
Compute placement: where the thinking happensWhere you spend extra compute matters as much as how much — and answer-time isn't necessarily the right place.
Small models and architecture choiceSeveral results — agentic discovery, sparse/loop architectures, and trained skills — converge on the same conclusion: architecture and training signal are accessible to small labs again.

Retrieval reframed: beyond the KV cache

Echo reframes retrieval as ridge regression solvable from running sufficient statistics — making the KV cache an implementation choice rather than a necessity — and hits 100% on associative recall benchmarks with ~77KB of state versus ~384MB per layer of KV cache at 131k tokens, getting *more* accurate at longer sequences E033. The wall-clock speedup hasn't landed yet, but the memory argument is settled.

Louver does the same for sparse attention: ANN methods are a category error for top-k attention retrieval because the geometry doesn't fit. Reframing as halfspace range searching produces a 15x speedup over PyTorch with >99.9% recall (versus 60-93% for ANN baselines) — and on MATH-500, the 'approximation' actually beats dense attention, with a denoising hypothesis for why E036. Both papers point at the same broader claim: a lot of efficiency engineering imported the wrong toolkit from web-search.

Looped and recurrent architectures

A per-layer 'sticky note' that survives between tokens — implemented as a nonlinear cross-position recurrence trained with a two-pass approximation — adds 15 points on GPQA-Diamond to a frozen Gemma for $1k of compute on a single GPU E032. Attractor models reframe looped transformers as fixed-point problems with constant training memory regardless of iteration depth, match a 1.3B Transformer at 770M parameters, and show emergent self-distillation: the trained backbone learns to put its first guess at the fixed point, making the refinement module obsolete at inference E041.

HRM-Text combines fast/slow recurrent modules, a PostNorm-like backward / PreNorm-like forward normalisation trick, and prompt-only attention to match Llama/Qwen/Gemma reasoning at 1B parameters trained on 16 GPUs for $1,500 E074. The shared message: the trillion-token race was solving in scale what architecture and training-signal choices could have solved more cheaply.

Compute placement: where the thinking happens

Hybrid-model performance turns out to be limited by computation depth during context ingestion, not by storage capacity. Adding a 'sleep' phase that loops compute over context right before the KV cache gets cleared — at zero cost to answer-time latency — lifts two-operation GSM-Infinite problems from ~60% to ~90% E085. The Rule 110 cellular-automaton experiment isolates this cleanly: when stored information is held constant and required computation varies, extra loops are what move the needle.

On the other side, much chain-of-thought is decorative. Telling a model to 'think step by step' often hurts accuracy at 50x the token cost — and whether reasoning helps is a property of the *model-query pair*, not the task. Reading three statistics off the entropy trajectory of the first 64 tokens lets you route queries between CoT and direct decoding, cutting token costs 30-50% with no accuracy loss E077.

Small models and architecture choice

Three Mile Island for the 'bigger is better' default this year: an HRM-Text 1B matches frontier general-purpose models for $1,500 E074; an attractor model matches a 1.3B Transformer at 770M E041; a 30B open recursive agent matches Sonnet 4 and o3 on Oolong-Real E028; AIRA-driven agentic NAS finds models that scale 54% faster than Llama 3.2 E053. None of these claims survive without asterisks, but the composite signal — that architectural questions are once again interesting to scrappy labs — is real, and it inverts the dominant story of the past three years.

Episodes anchoring this topic

033-echo-kv-cache-free-associative-recall-with-spectral-koopman-
Reframed retrieval as regression solvable from sufficient statistics, shrinking state by 5000x at long context.
036-sparse-attention-as-a-range-searching-problem-towards-an-inf
Argued sparse attention should be geometric range searching, not approximate nearest neighbour.
032-state-stream-transformer-sst-v2-parallel-training-of-nonline
Showed a tiny per-layer recurrent state adds 15 points to a frozen Gemma on PhD-level science.
085-language-models-need-sleep
Reframed the long-context bottleneck as ingestion-time compute rather than memory capacity.
074-hrm-text-efficient-pretraining-beyond-scaling
Matched frontier reasoning at 1B parameters on $1,500 of compute via architecture and loss choices.
041-solve-the-loop-attractor-models-for-language-and-reasoning
Demonstrated emergent self-distillation: training with iteration produces a model that no longer needs it.