Literature review · 6 episode(s)

Efficient architectures and serving

Rethinking attention and the KV cache

Standard ANN-based sparse attention quietly assumes that dropped tokens don't matter, but reframing top-k as halfspace range searching — bounding balls plus subspace decomposition — gives 15x speedups over PyTorch attention, beats FlashAttention at long context, and surprisingly beats dense attention on some reasoning benchmarks via implicit denoising E036. The more radical move is to drop the KV cache entirely: a closed-form ridge-regression view of retrieval with a Koopman spectral filter compresses 131k tokens of state into ~77KB and gets more accurate with longer sequences, inverting the SSM memory cliff E033. Together they suggest the KV cache was an implementation choice the field forgot was optional.

Latent recurrence and per-question thinking budgets

A per-layer sticky note added to a frozen Gemma gives 15 points on GPQA, with a halting probe at position zero that lets the model spend more compute only on hard questions E032. Attractor models treat looped Transformers as fixed-point problems and produce equilibrium internalisation: the trained backbone learns to put its first guess at the fixed point and the refinement module becomes obsolete at inference E041. HRM-Text formalises a similar fast/slow recurrence and reaches Llama/Gemma quality at 1B parameters on a $1,500 training run E074. The corpus has also surfaced a clean entropy-based router: read the shape of the model's uncertainty curve over the first 64 tokens to decide whether chain-of-thought is even worth invoking, cutting token costs by a third with no accuracy loss E077.

Serving infrastructure for agents, not chat

Throughput dashboards lie for agent workloads — tool pauses cause KV cache thrashing and CPU-GPU coupling strands GPU capacity. A multi-level feedback queue lifted from 1970s OS scheduling produces up to 5.94x mean latency reduction, with ~1.87x in a real OpenHands deployment E016. For tree-search-based agents and RL training, the bottleneck has been seconds-long sandbox checkpoint/rollback; OverlayFS + XFS reflinks plus a forked CRIU body double brings rollback to 5ms and masks the cost inside the LLM call you were waiting on anyway, pushing RL GPU utilisation from 51% to 99% E068. And computer-use agents specifically benefit from compiler-style planning: hedge over candidate plans, cache tools with state preconditions, and parallel browser sessions for heavy-tailed click latencies cut latency 10x E063. A more speculative bet says serving frameworks themselves should be bespoke — AI agents writing per-workload runtimes that match or beat vLLM on long-tail workloads E027.

Episodes anchoring this topic

036-sparse-attention-as-a-range-searching-problem-towards-an-inf
Reframed sparse attention as range search rather than ANN, beating FlashAttention with provable recall.
033-echo-kv-cache-free-associative-recall-with-spectral-koopman-
Replaced the KV cache with closed-form ridge regression and a spectral filter.
032-state-stream-transformer-sst-v2-parallel-training-of-nonline
Added per-layer latent memory to a frozen model with a halting probe for adaptive depth.
068-deltabox-scaling-stateful-ai-agents-with-millisecond-level-s
Brought sandbox checkpoint/rollback to millisecond scale via OS-level tricks.
016-mars-efficient-adaptive-co-scheduling-for-heterogeneous-agen
Imported classical OS scheduling ideas into LLM serving for agent workloads.
074-hrm-text-efficient-pretraining-beyond-scaling
Existence proof that 1B-parameter reasoning models can be trained for $1,500.