Literature review · 6 episode(s)

Efficient architectures and inference

← all topics  ·  Glossary →

Rethinking attention and the cache

Two episodes attack inference at its foundations. Reframing retrieval as a regression problem solvable from running statistics makes the an implementation choice, not a necessity — scores 100% on with a fixed state about five thousand times smaller, and gets more accurate with longer sequences E033. And the whole sparse- literature imported the wrong toolkit from web search: reframing as geometric range searching yields a method faster than with over 99.9% recall, that sometimes beats dense attention on accuracy E036.

The shared move is to question a load-bearing assumption — that dropped don't matter, or that the cache must grow linearly — that the field had stopped examining.

Recurrence and latent reasoning

A second strand gives a way to think between . A tiny per-layer 'sticky note' — about 330,000 parameters on a — produced a 15-point reasoning gain and revealed measurable structure to latent reasoning E032. Reframing a as a problem solved hard Sudoku at 27M parameters, and the iteration quietly disappeared at inference, absorbed into one — an emergent nobody designed in E041. And the bottleneck for reasoning may be compute during ingestion, not memory size, recoverable by a '' phase that loops compute before the cache clears at no answer-time cost E085.

These point at a shared question: should a model commit after one pass, or be able to deliberate — and where should that deliberation be paid for?

Efficient pretraining and architecture search

The trillion- race looks less inevitable than it did. A 1B model trained for $1,500 matched and on reasoning by keeping modules deliberating through the final layer and grading only response tokens — an existence proof that the compute-to-performance ratio isn't a law of nature E074. And architecture design is being handed to : eleven LLM agents explored 2,300 architectures and produced models beating Llama 3.2 at 1B, plus a training script that beat the human-tuned reference by importing from object detection — competent engineering recombination rather than new mathematics E053.

The upshot is that architectural questions, long dominated by the largest labs, are accessible to small teams again.

Serving stacks for the agentic era

LLM serving is professionalizing into systems research as break stacks built for chat. Throughput dashboards lie for agent workloads, and borrowing a from 1970s operating systems cut mean latency up to 5.94x by treating sessions as processes and the as virtual memory E016. Tree search and RL for agents were blocked not by models but by seconds-long — hijacking and to a filesystem and a process gives 5-millisecond rollback, closing a 30-point gap and pushing RL GPU utilization from 51% to 99% E068.

The same forkable-execution idea recurs at the level: layering forks a 5.8-gigabyte agent world in about a seventh of a second, enabling debugging and E096. The vocabulary of operating systems is becoming the vocabulary of agent infrastructure.

Episodes anchoring this topic