Glossary · Term

path patching

Definition

Plain language

Tracing which specific connections inside a model carry a particular behavior by intervening on them.

As stated in the literature

An interpretability method that intervenes on specific edges between components in a network to identify which causal pathways carry a behavior.

Why it matters: Pinpointing which internal pathways carry a behavior is what turns interpretability from suggestive into mechanistic.

For example, you swap the output of one attention head's edge to a downstream MLP and watch whether the model's answer changes.