Concept · 5 episode(s)

Linear Representation

← all concepts

Definition

The linear representation hypothesis is the conjecture that meaningful concepts in neural networks live along approximately linear directions in activation space — that “truth,” “refusal,” or “Spanish” is a vector you can find and steer along. Activation steering and most of mechanistic interpretability lean heavily on this being mostly true.

Episodes covering this

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.