Concept · 6 episode(s)

Causal Intervention

← all concepts

Definition

Causal interventions in a neural network swap, ablate, or patch internal activations to test whether a particular component causes a behavior, not just correlates with it. They’re the gold standard in mechanistic interpretability for the same reason randomized trials are in medicine: observation can’t distinguish cause from confound.

Episodes covering this

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.