Literature review · 6 episode(s)

Efficiency, architecture, and long-context inference

← all topics  ·  Glossary →

Retrieval reframed: beyond the KV cache

reframes retrieval as solvable from running — making the an implementation choice rather than a necessity — and hits 100% on benchmarks with ~77KB of state versus ~384MB per layer of KV cache at 131k , getting *more* accurate at longer sequences E033. The wall-clock speedup hasn't landed yet, but the memory argument is settled.

does the same for sparse : methods are a category error for attention retrieval because the geometry doesn't fit. Reframing as produces a 15x speedup over PyTorch with >99.9% recall (versus 60-93% for ANN baselines) — and on , the 'approximation' actually beats dense attention, with a denoising hypothesis for why E036. Both papers point at the same broader claim: a lot of efficiency engineering imported the wrong toolkit from web-search.

Looped and recurrent architectures

A per-layer 'sticky note' that survives between — implemented as a nonlinear cross-position trained with a two-pass approximation — adds 15 points on to a frozen for $1k of compute on a single GPU E032. Attractor models reframe looped as problems with constant training memory regardless of iteration depth, match a 1.3B Transformer at 770M parameters, and show emergent : the trained learns to put its first guess at the fixed point, making the refinement module obsolete at inference E041.

combines fast/slow modules, a PostNorm-like backward / PreNorm-like forward normalisation trick, and prompt-only to match // reasoning at 1B parameters trained on 16 GPUs for $1,500 E074. The shared message: the trillion- race was solving in scale what architecture and training-signal choices could have solved more cheaply.

Compute placement: where the thinking happens

Hybrid-model performance turns out to be limited by computation depth during context ingestion, not by storage capacity. Adding a 'sleep' phase that loops compute over context right before the gets cleared — at zero cost to answer-time latency — lifts two-operation problems from ~60% to ~90% E085. The cellular-automaton experiment isolates this cleanly: when stored information is held constant and required computation varies, extra loops are what move the needle.

On the other side, much is decorative. Telling a model to 'think step by step' often hurts accuracy at 50x the cost — and whether reasoning helps is a property of the *model-query pair*, not the task. Reading three statistics off the of the first 64 tokens lets you route queries between CoT and direct decoding, cutting token costs 30-50% with no accuracy E077.

Small models and architecture choice

Three Mile Island for the 'bigger is better' default this year: an 1B matches frontier general-purpose models for $1,500 E074; an attractor model matches a 1.3B at 770M E041; a 30B open recursive matches 4 and on E028; AIRA-driven agentic finds models that scale 54% faster than 3.2 E053. None of these claims survive without asterisks, but the composite signal — that architectural questions are once again interesting to scrappy labs — is real, and it inverts the dominant story of the past three years.

Episodes anchoring this topic