Theme · 10 episode(s)

Mechanistic Interpretability

← all concepts

Definition

Mechanistic interpretability is the project of reverse-engineering trained neural networks into human-readable descriptions of how they work: what features they compute, how those features combine, what algorithms emerge. The bet is that this kind of understanding is necessary to trust the systems we build.

Episodes covering this

  1. 055
    Why LLM Judges Flip Their Verdicts When You Change the Question Format
    Feldhus, Baeumel, Golimblevskaia et al. · Technische Universität Berlin / BIFOLD·26 min·May 19, 2026
  2. 038
    How LLMs Get Persuaded: One Attention Head, A Tetrahedron, And A Single Dial
    Sun, Kong, Zhang et al. · Northeastern University·23 min·May 12, 2026
  3. 037
    Why Hallucination Detectors Miss Stale Facts: A Geometric Story About What Models Know But Don't Say
    Elbadry, Heakl, Zhang et al. · Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)·27 min·May 12, 2026
  4. 033
    Echo: The Paper Arguing You Never Needed a KV Cache for Retrieval
    Sridhar, Johansen · California·24 min·May 11, 2026
  5. 032
    A Sticky-Note for Every Layer: Letting Transformers Remember What They Were Just Thinking
    Aviss · Fifth Dimension·23 min·May 09, 2026
  6. 026
    What RL Actually Does to Language Models, at the Token Level
    Akgül, Kannan, Neiswanger et al. · University of Southern California·24 min·May 08, 2026
  7. 023
    Why a Small Agent Confidently Overwrites Memories It Doesn't Understand
    Mao, Zhao, Penn et al. · City University of Hong Kong·23 min·May 07, 2026
  8. 018
    Language Models Compute the Rational Move, Then Override It
    Lekeas, Stamatopoulos · DreamWorks Animation·29 min·May 03, 2026
  9. 006
    What Happens Inside Claude When It Decides to Blackmail Someone
    Sofroniew, Kauvar, Saunders et al. · Anthropic·22 min·May 02, 2026
  10. 004
    The Sycophancy Circuit That Survives Alignment Training
    Pandey · Georgia Institute of Technology·29 min·May 01, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.