Definition
A small auxiliary network trained to break a model's dense internal states into a few interpretable pieces at a time.
An interpretability tool that learns an overcomplete sparse dictionary over model activations, producing approximately equivalent reconstructions while exposing more monosemantic features.
Also called: SAE, sparse autoencoders