Definition
A small extra network that translates dense, hard-to-read neural activations into a sparser, more interpretable form.
An interpretability tool that replaces an MLP layer with a sparse coded version producing roughly equivalent outputs while exposing more interpretable features.
Also called: transcoders