Glossary · Term

transcoder

← all terms

Definition

A small extra network that translates dense, hard-to-read neural activations into a sparser, more interpretable form.

An interpretability tool that replaces an MLP layer with a sparse coded version producing roughly equivalent outputs while exposing more interpretable features.

Also called: transcoders

Mentioned in 1 episode

  1. 023
    Why a Small Agent Confidently Overwrites Memories It Doesn't Understand