Theme · 23 episode(s)

Mechanistic Interpretability

Definition

Mechanistic interpretability is the project of reverse-engineering trained neural networks into human-readable descriptions of how they work: what features they compute, how those features combine, what algorithms emerge. The bet is that this kind of understanding is necessary to trust the systems we build.

Episodes covering this

204
The Length Estimate Hiding Inside a Word-by-Word Model
How Much is Left? LLMs Linearly Encode Their Remaining Output Length
· ·14 min·Jul 07, 2026
203
The Thought a Model Doesn't Say — and the Lens That Reads It
Verbalizable Representations Form a Global Workspace in Language Models
Gurnee, Sofroniew, Pearce et al. · Anthropic·16 min·Jul 07, 2026
199
Finding a Model's Hidden Behaviors Without Knowing What You're Looking For
Mechanistically Eliciting Latent Behaviors in Language Models
Mack, Panickssery, Turner · Principles of Intelligence·15 min·Jul 04, 2026
193
Freeze Most of the Network: Where RL Improvement Actually Lives in a Transformer
Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training
Zhang, Hu, Glentis et al. · University of Minnesota·22 min·Jul 02, 2026
175
One Crosscoder Feature Flips a Stalling Chatbot Into a Working Agent
Localizing RL-Induced Tool Use to a Single Crosscoder Feature
Shportko, Bhokare, AlZahrani et al. · Northwestern University·26 min·Jun 26, 2026
171
The Safety Decision a Model Makes Before It Thinks a Word
Do Thinking Tokens Help with Safety?
Ri, Panigrahi, Arora · Princeton Language and Intelligence·25 min·Jun 25, 2026
153
Catching a Lie From the Inside, When the Words Look Completely Honest
Rift: A Conflict Signature for Deception in Language Models
Nyoma · Harmonic Labs·26 min·Jun 18, 2026
152
Training a Model to Mean What It Says, And Why That Isn't the Same as Being Good
Self-CTRL: Self-Consistency Training with Reinforcement Learning
Pres, Ruis, Ghebreselassie et al. · MIT CSAIL·26 min·Jun 18, 2026
145
Building Forgetting Into a Language Model With One Extra Line of Code
Natively Unlearnable Large Language Models
Ghosal, Maini, Raghunathan · Carnegie Mellon University·22 min·Jun 15, 2026
141
How Two Tokens Reopened a Reasoning Method the Field Had Given Up On
Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning
Yang, Chen, Wu et al. · HKUST(GZ)·29 min·Jun 12, 2026
140
When a Reasoning Model Says "Let Me Double-Check" After It's Already Decided
Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models
Scalena, Candussio, Bortolussi et al. · University of Groningen / University of Milano-Bicocca·27 min·Jun 12, 2026
098
Finding Millions of Readable Concepts Inside a Real, Deployed AI Model
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Templeton, Conerly, Marcus et al. · Anthropic·28 min·May 29, 2026
094
Chain-of-Thought Monitoring Fails Across Languages, and Worst Where It's Needed Most
The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages
Onyame, Zhou, Thopalli et al. · University of Virginia·24 min·May 28, 2026
055
Why LLM Judges Flip Their Verdicts When You Change the Question Format
Judge Circuits
Feldhus, Baeumel, Golimblevskaia et al. · Technische Universität Berlin / BIFOLD·26 min·May 19, 2026
038
How LLMs Get Persuaded: One Attention Head, A Tetrahedron, And A Single Dial
How LLMs Are Persuaded: A Few Attention Heads, Rerouted
Sun, Kong, Zhang et al. · Northeastern University·23 min·May 12, 2026
037
Why Hallucination Detectors Miss Stale Facts: A Geometric Story About What Models Know But Don't Say
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
Elbadry, Heakl, Zhang et al. · Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)·27 min·May 12, 2026
033
Echo: The Paper Arguing You Never Needed a KV Cache for Retrieval
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators
Sridhar, Johansen · California·24 min·May 11, 2026
032
A Sticky-Note for Every Layer: Letting Transformers Remember What They Were Just Thinking
State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning
Aviss · Fifth Dimension·23 min·May 09, 2026
026
What RL Actually Does to Language Models, at the Token Level
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
Akgül, Kannan, Neiswanger et al. · University of Southern California·24 min·May 08, 2026
023
Why a Small Agent Confidently Overwrites Memories It Doesn't Understand
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
Mao, Zhao, Penn et al. · City University of Hong Kong·23 min·May 07, 2026
018
Language Models Compute the Rational Move, Then Override It
What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control
Lekeas, Stamatopoulos · DreamWorks Animation·29 min·May 03, 2026
006
What Happens Inside Claude When It Decides to Blackmail Someone
Emotion Concepts and their Function in a Large Language Model
Sofroniew, Kauvar, Saunders et al. · Anthropic·22 min·May 02, 2026
004
The Sycophancy Circuit That Survives Alignment Training
LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit
Pandey · Georgia Institute of Technology·29 min·May 01, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.