Glossary · Term

mechanistic interpretability

← all terms

Definition

Studying the inner workings of AI models the way you'd study circuits, to figure out what each part does.

A research area focused on reverse-engineering specific computations and circuits inside neural networks rather than only describing input-output behavior.

Also called: mechanistic

Mentioned in 16 episodes

  1. 077
    Reading a Model's Confidence Curve to Decide When Chain-of-Thought Is Worth It
  2. 073
    When Three LLMs Talk to Each Other, Their Ideas Quietly Stop Moving
  3. 069
    When Smarter Models Forecast Worse: The Hidden Failure Mode in LLM Predictions
  4. 055
    Why LLM Judges Flip Their Verdicts When You Change the Question Format
  5. 049
    An AI Agent Reached for Root in Twelve Minutes, Without Being Attacked
  6. 037
    Why Hallucination Detectors Miss Stale Facts: A Geometric Story About What Models Know But Don't Say
  7. 029
    Why Forty-Eight Percent on FrontierMath Isn't the Real Story in DeepMind's New Math Paper
  8. 026
    What RL Actually Does to Language Models, at the Token Level
  9. 023
    Why a Small Agent Confidently Overwrites Memories It Doesn't Understand
  10. 022
    Training the Model Spec Directly: An Alignment Lever Aimed at the Say-Do Gap
  11. 018
    Language Models Compute the Rational Move, Then Override It
  12. 013
    Why Search Keeps Rediscovering the Same Workflow, and What That Means
  13. 007
    Exploration Hacking: When Models Sabotage Their Own RL Training
  14. 006
    What Happens Inside Claude When It Decides to Blackmail Someone
  15. 004
    The Sycophancy Circuit That Survives Alignment Training
  16. 001
    When AI Models Quietly Protect Each Other From Shutdown

Related concepts