Glossary · Term

activation patching

← all terms

Definition

Swapping a piece of a model's internal state from one run into another to test what that piece does.

An interpretability technique that substitutes internal activations from one forward pass into another to test whether a specific component is causally responsible for a behavior.

Mentioned in 2 episodes

  1. 038
    How LLMs Get Persuaded: One Attention Head, A Tetrahedron, And A Single Dial
  2. 037
    Why Hallucination Detectors Miss Stale Facts: A Geometric Story About What Models Know But Don't Say

Related concepts