Definition
Transcoders are an interpretability primitive that learns a sparse, often more-interpretable substitute for an MLP layer — sitting between the inputs and outputs of the original layer with constraints that encourage feature-like behavior. They’re used to extend sparse-autoencoder-style analysis through the MLP path.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.