Definition
Mechanistic interpretability is the project of reverse-engineering trained neural networks into human-readable descriptions of how they work: what features they compute, how those features combine, what algorithms emerge. The bet is that this kind of understanding is necessary to trust the systems we build.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.
- Representation Engineering: A Top-Down Approach to AI Transparency
- Scaling and evaluating sparse autoencoders
- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- Towards the Fundamental Limits of Knowledge over Parametric Models