Definition
Causal interventions in a neural network swap, ablate, or patch internal activations to test whether a particular component causes a behavior, not just correlates with it. They’re the gold standard in mechanistic interpretability for the same reason randomized trials are in medicine: observation can’t distinguish cause from confound.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.