Glossary · Term

activation patching

Definition

Plain language

Swapping a piece of a model's internal state from one run into another to test what that piece does.

As stated in the literature

An interpretability technique that substitutes internal activations from one forward pass into another to test whether a specific component is causally responsible for a behavior.

Why it matters: It's one of the few techniques that lets researchers make causal claims about what specific parts of a neural network actually do.

For example, you can grab the hidden state from the middle of a model running on 'Paris' and paste it into a run on 'Tokyo' to see if the output flips to 'France'.

Heard on the show

“The technique is called activation patching.”

Episode 038 — How LLMs Get Persuaded: One Attention Head, A Tetrahedron, And A Single Dial

Mentioned in 2 episodes

038
How LLMs Get Persuaded: One Attention Head, A Tetrahedron, And A Single Dial
037
Why Hallucination Detectors Miss Stale Facts: A Geometric Story About What Models Know But Don't Say

Related concepts

Path Patching

Related terms

forward pass