Literature review · 6 episode(s)

Efficient architectures and inference

Rethinking attention and the cache

Two episodes attack long-context inference at its foundations. Reframing retrieval as a regression problem solvable from running statistics makes the KV cache an implementation choice, not a necessity — Echo scores 100% on associative recall with a fixed state about five thousand times smaller, and gets more accurate with longer sequences E033. And the whole sparse-attention literature imported the wrong toolkit from web search: reframing top-k retrieval as geometric range searching yields a method faster than FlashAttention with over 99.9% recall, that sometimes beats dense attention on accuracy E036.

The shared move is to question a load-bearing assumption — that dropped tokens don't matter, or that the cache must grow linearly — that the field had stopped examining.

Recurrence and latent reasoning

A second strand gives transformers a way to think between tokens. A tiny per-layer 'sticky note' — about 330,000 parameters on a frozen Gemma — produced a 15-point reasoning gain and revealed measurable structure to latent reasoning E032. Reframing a looped transformer as a fixed-point problem solved hard Sudoku at 27M parameters, and the iteration quietly disappeared at inference, absorbed into one forward pass — an emergent self-distillation nobody designed in E041. And the bottleneck for long-context reasoning may be compute during ingestion, not memory size, recoverable by a 'sleep' phase that loops compute before the cache clears at no answer-time cost E085.

These point at a shared question: should a model commit after one pass, or be able to deliberate — and where should that deliberation be paid for?

Efficient pretraining and architecture search

The trillion-token race looks less inevitable than it did. A 1B model trained for $1,500 matched Llama and Gemma on reasoning by keeping recurrent modules deliberating through the final layer and grading only response tokens — an existence proof that the compute-to-performance ratio isn't a law of nature E074. And architecture design is being handed to agents: eleven LLM agents explored 2,300 architectures and produced models beating Llama 3.2 at 1B, plus a training script that beat the human-tuned reference by importing focal loss from object detection — competent engineering recombination rather than new mathematics E053.

The upshot is that architectural questions, long dominated by the largest labs, are accessible to small teams again.

Serving stacks for the agentic era

LLM serving is professionalizing into systems research as agents break stacks built for chat. Throughput dashboards lie for agent workloads, and borrowing a multi-level feedback queue from 1970s operating systems cut mean latency up to 5.94x by treating sessions as processes and the KV cache as virtual memory E016. Tree search and RL for agents were blocked not by models but by seconds-long sandbox checkpointing — hijacking OverlayFS and CRIU to fork a filesystem and freeze a process gives 5-millisecond rollback, closing a 30-point SWE-bench gap and pushing RL GPU utilization from 51% to 99% E068.

The same forkable-execution idea recurs at the agent level: copy-on-write layering forks a 5.8-gigabyte agent world in about a seventh of a second, enabling counterfactual replay debugging and credit assignment E096. The vocabulary of operating systems is becoming the vocabulary of agent infrastructure.

Episodes anchoring this topic

033-echo-kv-cache-free-associative-recall-with-spectral-koopman-
Reframed retrieval as regression, making the KV cache an implementation choice.
036-sparse-attention-as-a-range-searching-problem-towards-an-inf
Recast sparse attention as geometric range search, beating FlashAttention with near-perfect recall.
032-state-stream-transformer-sst-v2-parallel-training-of-nonline
Added per-layer latent memory and surfaced structure in latent reasoning.
074-hrm-text-efficient-pretraining-beyond-scaling
Showed a $1,500 1B model matching frontier reasoning via architecture and objective.
016-mars-efficient-adaptive-co-scheduling-for-heterogeneous-agen
Applied classical OS scheduling to fix agent-workload serving latency.
068-deltabox-scaling-stateful-ai-agents-with-millisecond-level-s
Used OS checkpointing tricks to unblock tree search and RL for stateful agents.