Concept · 2 episode(s)

Sparse Features / SAE

← all concepts

Definition

Sparse autoencoders (SAEs) are trained to reconstruct a model’s activations from a much wider, sparsely active hidden layer, with the hope that the sparse units correspond to interpretable features. They’ve become a central tool in mechanistic interpretability.

Episodes covering this

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.