Glossary · Term

activation steering

← all terms

Definition

Nudging a model's internal state at inference time to push its behavior in a particular direction.

Adding a scaled vector to the residual stream during inference to bias the model toward a target behavior without changing weights.

Also called: steering

Mentioned in 9 episodes

  1. 073
    When Three LLMs Talk to Each Other, Their Ideas Quietly Stop Moving
  2. 055
    Why LLM Judges Flip Their Verdicts When You Change the Question Format
  3. 044
    How One Sentence and a Forged History Flip the Most Aligned Models
  4. 040
    Two Frozen Models Learn to Whisper: Coupling Through Hidden States
  5. 037
    Why Hallucination Detectors Miss Stale Facts: A Geometric Story About What Models Know But Don't Say
  6. 026
    What RL Actually Does to Language Models, at the Token Level
  7. 025
    The Missing Gradient Term That Predicts Sycophancy in RLHF
  8. 018
    Language Models Compute the Rational Move, Then Override It
  9. 006
    What Happens Inside Claude When It Decides to Blackmail Someone

Related concepts