Glossary · Term

sparse autoencoder

← all terms

Definition

A small auxiliary network trained to break a model's dense internal states into a few interpretable pieces at a time.

An interpretability tool that learns an overcomplete sparse dictionary over model activations, producing approximately equivalent reconstructions while exposing more monosemantic features.

Also called: SAE, sparse autoencoders

Mentioned in 1 episode

  1. 055
    Why LLM Judges Flip Their Verdicts When You Change the Question Format