sparse autoencoder

Definition

Plain language

A small auxiliary network trained to break a model's dense internal states into a few interpretable pieces at a time.

As stated in the literature

An interpretability tool that learns an overcomplete sparse dictionary over model activations, producing approximately equivalent reconstructions while exposing more monosemantic features.

Also called: SAE, sparse autoencoders

Why it matters: By pulling apart entangled activations into roughly one-concept-per-feature pieces, SAEs are one of the main tools for opening the black box of large language models.

For example, the autoencoder might learn that one specific feature lights up exactly when the model is thinking about Paris, while another fires for past-tense verbs.

Heard on the show

“… every technical term tap-to-define, with links to the related papers grouped by theme, from sparse autoencoders to the original crosscoder work, plus our weekly and monthly roundups. …”

Episode 175 — One Crosscoder Feature Flips a Stalling Chatbot Into a Working Agent

Definition

Heard on the show

Mentioned in 3 episodes

Related concepts

Related terms