Glossary · Term

emotion vector

← all terms

Definition

A direction inside a model that corresponds to a particular feeling, like fear or calm.

A linear direction in a transformer's residual stream encoding a specific affective concept, derivable from mean-difference over emotion-conditioned text and causally manipulable via activation steering.

Also called: emotion vectors

Mentioned in 1 episode

  1. 006
    What Happens Inside Claude When It Decides to Blackmail Someone