activation steering · Glossary · AI Papers: A Deep Dive

Definition

Plain language

Nudging a model's internal state at inference time to push its behavior in a particular direction.

As stated in the literature

Adding a scaled vector to the residual stream during inference to bias the model toward a target behavior without changing weights.

Also called: steering

Why it matters: It offers a lightweight way to control model behavior at inference time, useful for both alignment research and product-level personality tuning.

For example, adding a 'cheerful' direction to the residual stream can make a model's replies sound more upbeat without any retraining.

Heard on the show

“They built on what the model gave back, steering and co-writing.”

Episode 205 — The Same AI, Two Labels: How the Pitch Beat the Product in 162 Sessions

Mentioned in 21 episodes

Related concepts

Activation Steering Agentic Vuln Discovery Classifier-Free Guidance Emotion Vectors Linear Representation

Related terms

inference residual stream weights