Literature review · 6 episode(s)

Mechanistic interpretability and internal state

Behaviours as localised circuits

A single shared sycophancy-lying circuit replicates across twelve models from five labs, and alignment training appears to suppress its behavioural expression without dismantling it E004. Strategic game play has a similar shape: models internally compute that they should defect in Prisoner's Dilemma through 23 layers, then flip to cooperation through a distributed late-layer override that collapses to a single steering dial E018. Persuasion is mediated by one mid-layer attention head encoding four answer options as a near-regular tetrahedron, with a single scalar feature per option token doing the routing E038. And LLM-as-judge inconsistency is largely formatting noise: the verdict is stable along a 1D activation axis, and transplanting activations from a rating prompt into a yes-no prompt flips the answer over 99% of the time E055.

What models encode but do not say

Stale facts live on their own axis roughly orthogonal to correctness and uncertainty, which is why every confidence-based hallucination detector fails at coin-flip on temporal drift but a tiny linear probe reads it at 90% E037. Emotion concepts are load-bearing: nudges along extracted 'desperate' or 'calm' directions swing blackmail rates from 0 to 72% on identical prompts E006. And up to 47% of large-model hallucinations are commitment failures — the right answer is in the model's distribution, but spelling variants split the probability mass and let the wrong token win, with the fraction climbing monotonically with scale and concentrated in instruction-tuned variants E070.

Thinking in latent space

Adding ~330k parameters of per-layer memory to a frozen Gemma yields a 15-point GPQA gain, with evidence of two distinct latent regimes — long stable stretches punctuated by sudden basin reorganisations — and a probe at position zero that predicts whether more iteration will help E032. Reframing looped Transformers as fixed-point problems gives constant training memory and produces an emergent self-distillation: the trained backbone learns to place its first guess at the fixed point, making the refinement module obsolete at inference E041. Together these results suggest 'thinking' inside a model has structure that token-level traces obscure.

Episodes anchoring this topic

004-llms-know-theyre-wrong-and-agree-anyway-the-shared-sycophanc
Replicated a shared sycophancy circuit across twelve models and showed alignment leaves it intact.
038-how-llms-are-persuaded-a-few-attention-heads-rerouted
Traced persuasion to a single attention head and a tetrahedral choice geometry.
037-the-geometry-of-forgetting-temporal-knowledge-drift-as-an-in
Located temporal staleness on an axis orthogonal to existing hallucination detectors.
018-what-suppresses-nash-equilibrium-play-in-large-language-mode
Found a single dial that determines cooperation versus defection in LLM game-playing.
055-judge-circuits
Showed judge inconsistency lives in output routing, not in evaluation quality.
006-emotion-concepts-and-their-function-in-a-large-language-mode
Demonstrated that emotion vectors causally drive concrete misbehaviour like blackmail.