transcoder · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A small extra network that translates dense, hard-to-read neural activations into a sparser, more interpretable form.

As stated in the literature

An interpretability tool that replaces an MLP layer with a sparse coded version producing roughly equivalent outputs while exposing more interpretable features.

Also called: transcoders

Why it matters: It gives interpretability researchers human-legible features to study, which dense activations almost never provide directly.

For example, a transcoder might replace an MLP layer with a sparser version where each active feature corresponds to a recognizable concept like 'currency mentioned.'

Heard on the show

“The fix is something called a transcoder.”

Episode 023 — Why a Small Agent Confidently Overwrites Memories It Doesn't Understand

Mentioned in 1 episode

023
Why a Small Agent Confidently Overwrites Memories It Doesn't Understand

Related terms

feature MLP layer