Definition
The linear representation hypothesis is the conjecture that meaningful concepts in neural networks live along approximately linear directions in activation space — that “truth,” “refusal,” or “Spanish” is a vector you can find and steer along. Activation steering and most of mechanistic interpretability lean heavily on this being mostly true.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.
- The Internal State of an LLM Knows When It's Lying
- Representation Engineering: A Top-Down Approach to AI Transparency
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
- The Platonic Representation Hypothesis
- Towards the Fundamental Limits of Knowledge over Parametric Models