Concept · 1 episode(s)

Transcoder

Definition

Transcoders are an interpretability primitive that learns a sparse, often more-interpretable substitute for an MLP layer — sitting between the inputs and outputs of the original layer with constraints that encourage feature-like behavior. They’re used to extend sparse-autoencoder-style analysis through the MLP path.

Episodes covering this

023
Why a Small Agent Confidently Overwrites Memories It Doesn't Understand
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
Mao, Zhao, Penn et al. · City University of Hong Kong·23 min·May 07, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.

Transcoders Find Interpretable LLM Feature Circuits