Concept · 1 episode(s)

Path Patching

← all concepts

Definition

Path patching is a mechanistic interpretability technique that swaps the contribution of a specific path through a network (say, a head’s output flowing into a later head’s query) between two inputs, measuring how much that single path is responsible for a behavior. It’s the scalpel version of activation patching.

Episodes covering this