steering vector · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A direction in a model's internal state that, when added in, pushes the model toward a particular behavior.

As stated in the literature

A direction in activation space added to the residual stream during inference to bias the model toward a target behavior without retraining.

Also called: steering vectors

Why it matters: It lets people adjust a model's behavior on the fly without expensive retraining, which is useful for both alignment research and red-teaming.

For example, researchers can compute a 'refusal' direction and subtract it from a chatbot's internal state to make it more willing to answer.

Heard on the show

“There's even a striking result where researchers used steering vectors to *suppress* the evaluation-awareness representations inside a frontier model, and misaligned behavior went *up*.”

Episode 128 — How a Model Can Earn Full Reward and Still Resist Training

Mentioned in 1 episode

128
How a Model Can Earn Full Reward and Still Resist Training

Related concepts

Activation Steering

Related terms

inference residual stream