Concept · 6 episode(s)

Activation Steering

← all concepts

Definition

Activation steering is an inference-time technique that adds a fixed vector to a model’s internal activations to push behavior in a desired direction — more honest, more refusing, less sycophantic — without retraining. The steering vector is typically derived by contrasting activations on prompts that do and don’t exhibit the target behavior.

Episodes covering this

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.