Glossary · Term

routing feature

← all terms

Definition

A single direction inside a model that decides which option an attention head will copy.

In Judge Circuits and related interpretability work, a one-dimensional direction in the residual stream at option tokens whose magnitude determines which choice the decision head attends to and copies.

Mentioned in 1 episode

  1. 038
    How LLMs Get Persuaded: One Attention Head, A Tetrahedron, And A Single Dial