Literature review · 6 episode(s)

Mechanistic interpretability and internal state

← all topics  ·  Glossary →

Behaviours as localised circuits

A single shared -lying circuit replicates across twelve models from five labs, and appears to suppress its behavioural expression without dismantling it E004. Strategic game play has a similar shape: models internally compute that they should defect in Prisoner's Dilemma through 23 layers, then flip to cooperation through a distributed late-layer override that collapses to a single dial E018. Persuasion is mediated by one mid-layer encoding four answer options as a near-regular tetrahedron, with a single scalar feature per option doing the routing E038. And inconsistency is largely formatting noise: the verdict is stable along a 1D activation axis, and transplanting activations from a rating prompt into a yes-no prompt flips the answer over 99% of the time E055.

What models encode but do not say

Stale facts live on their own axis roughly orthogonal to correctness and uncertainty, which is why every confidence-based detector fails at coin-flip on temporal drift but a tiny reads it at 90% E037. Emotion concepts are load-bearing: nudges along extracted 'desperate' or 'calm' directions swing blackmail rates from 0 to 72% on identical prompts E006. And up to 47% of large-model hallucinations are — the right answer is in the model's distribution, but spelling variants split the probability mass and let the wrong win, with the fraction climbing monotonically with scale and concentrated in instruction-tuned variants E070.

Thinking in latent space

Adding ~330k parameters of per-layer memory to a frozen yields a 15-point gain, with evidence of two distinct latent regimes — long stable stretches punctuated by sudden reorganisations — and a at position zero that predicts whether more iteration will help E032. Reframing looped Transformers as problems gives constant training memory and produces an emergent : the trained learns to place its first guess at the fixed point, making the refinement module obsolete at inference E041. Together these results suggest 'thinking' inside a model has structure that -level traces obscure.

Episodes anchoring this topic