Literature review · 6 episode(s)

Efficient architectures and serving

← all topics  ·  Glossary →

Rethinking attention and the KV cache

Standard -based sparse quietly assumes that dropped don't matter, but reframing as — bounding balls plus subspace decomposition — gives 15x speedups over PyTorch attention, beats at long context, and surprisingly beats dense attention on some reasoning benchmarks via implicit denoising E036. The more radical move is to drop the entirely: a closed-form ridge-regression view of retrieval with a spectral filter compresses 131k tokens of state into ~77KB and gets more accurate with longer sequences, inverting the E033. Together they suggest the KV cache was an implementation choice the field forgot was optional.

Latent recurrence and per-question thinking budgets

A per-layer sticky note added to a frozen gives 15 points on , with a at position zero that lets the model spend more compute only on hard questions E032. Attractor models treat looped Transformers as problems and produce equilibrium internalisation: the trained learns to put its first guess at the fixed point and the refinement module becomes obsolete at inference E041. formalises a similar fast/slow and reaches /Gemma quality at 1B parameters on a $1,500 training run E074. The corpus has also surfaced a clean -based router: read the shape of the model's uncertainty curve over the first 64 to decide whether is even worth invoking, cutting token costs by a third with no accuracy E077.

Serving infrastructure for agents, not chat

Throughput dashboards lie for workloads — tool pauses cause thrashing and -GPU coupling strands GPU capacity. A lifted from 1970s OS scheduling produces up to 5.94x mean latency reduction, with ~1.87x in a real deployment E016. For tree-search-based agents and RL training, the bottleneck has been seconds-long /rollback; + plus a forked body double brings rollback to 5ms and masks the cost inside the LLM call you were waiting on anyway, pushing RL GPU utilisation from 51% to 99% E068. And computer-use agents specifically benefit from compiler-style planning: over candidate plans, cache tools with state , and parallel browser sessions for click latencies cut latency 10x E063. A more speculative bet says serving frameworks themselves should be bespoke — AI agents writing per-workload runtimes that match or beat on long-tail workloads E027.

Episodes anchoring this topic