Concept · 4 episode(s)

Sparse Features / SAE

Definition

Sparse autoencoders (SAEs) are trained to reconstruct a model’s activations from a much wider, sparsely active hidden layer, with the hope that the sparse units correspond to interpretable features. They’ve become a central tool in mechanistic interpretability.

Episodes covering this

175
One Crosscoder Feature Flips a Stalling Chatbot Into a Working Agent
Localizing RL-Induced Tool Use to a Single Crosscoder Feature
Shportko, Bhokare, AlZahrani et al. · Northwestern University·26 min·Jun 26, 2026
098
Finding Millions of Readable Concepts Inside a Real, Deployed AI Model
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Templeton, Conerly, Marcus et al. · Anthropic·28 min·May 29, 2026
055
Why LLM Judges Flip Their Verdicts When You Change the Question Format
Judge Circuits
Feldhus, Baeumel, Golimblevskaia et al. · Technische Universität Berlin / BIFOLD·26 min·May 19, 2026
023
Why a Small Agent Confidently Overwrites Memories It Doesn't Understand
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
Mao, Zhao, Penn et al. · City University of Hong Kong·23 min·May 07, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.