Concept · 1 episode(s)

Transcoder

← all concepts

Definition

Transcoders are an interpretability primitive that learns a sparse, often more-interpretable substitute for an MLP layer — sitting between the inputs and outputs of the original layer with constraints that encourage feature-like behavior. They’re used to extend sparse-autoencoder-style analysis through the MLP path.

Episodes covering this

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.