Definition
Sparse autoencoders (SAEs) are trained to reconstruct a model’s activations from a much wider, sparsely active hidden layer, with the hope that the sparse units correspond to interpretable features. They’ve become a central tool in mechanistic interpretability.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.