mechanistic interpretability · Glossary

Definition

Plain language

Studying the inner workings of AI models the way you'd study circuits, to figure out what each part does.

As stated in the literature

A research area focused on reverse-engineering specific computations and circuits inside neural networks rather than only describing input-output behavior.

Also called: mechanistic

Why it matters: Knowing how a model does what it does — not just what it does — is the most direct path to predicting and fixing its failures.

For example, researchers might identify a specific attention head that always copies the subject from earlier in a sentence and trace exactly how it implements that behavior.

Heard on the show

“But there's no mechanistic theory for why the center of the stack is where RL adaptation lands.”

Episode 193 — Freeze Most of the Network: Where RL Improvement Actually Lives in a Transformer

Mentioned in 24 episodes

Related concepts

Causal Intervention Circuit Analysis Linear Representation Path Patching Sparse Features / SAE

Related terms

circuit neural network