Definition
Activation steering is an inference-time technique that adds a fixed vector to a model’s internal activations to push behavior in a desired direction — more honest, more refusing, less sycophantic — without retraining. The steering vector is typically derived by contrasting activations on prompts that do and don’t exhibit the target behavior.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.