Concept · 11 episode(s)

Activation Steering

Definition

Activation steering is an inference-time technique that adds a fixed vector to a model’s internal activations to push behavior in a desired direction — more honest, more refusing, less sycophantic — without retraining. The steering vector is typically derived by contrasting activations on prompts that do and don’t exhibit the target behavior.

Episodes covering this

203
The Thought a Model Doesn't Say — and the Lens That Reads It
Verbalizable Representations Form a Global Workspace in Language Models
Gurnee, Sofroniew, Pearce et al. · Anthropic·16 min·Jul 07, 2026
185
Aligned to Refuse, Built to Tap: When Phone Agents Know the Task Is a Crime and Do It Anyway
It Lied to a Doctor to Buy Poison Ingredients: Quantifying Real-World Misuse of Phone-use Agents
Sun, Chen, Zhou et al. · Fudan University·27 min·Jun 30, 2026
175
One Crosscoder Feature Flips a Stalling Chatbot Into a Working Agent
Localizing RL-Induced Tool Use to a Single Crosscoder Feature
Shportko, Bhokare, AlZahrani et al. · Northwestern University·26 min·Jun 26, 2026
153
Catching a Lie From the Inside, When the Words Look Completely Honest
Rift: A Conflict Signature for Deception in Language Models
Nyoma · Harmonic Labs·26 min·Jun 18, 2026
098
Finding Millions of Readable Concepts Inside a Real, Deployed AI Model
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Templeton, Conerly, Marcus et al. · Anthropic·28 min·May 29, 2026
073
When Three LLMs Talk to Each Other, Their Ideas Quietly Stop Moving
Multi-LLM Systems Exhibit Robust Semantic Collapse
Kong, Lai, Piao et al. · University of Toronto·28 min·May 23, 2026
055
Why LLM Judges Flip Their Verdicts When You Change the Question Format
Judge Circuits
Feldhus, Baeumel, Golimblevskaia et al. · Technische Universität Berlin / BIFOLD·26 min·May 19, 2026
040
Two Frozen Models Learn to Whisper: Coupling Through Hidden States
The Bicameral Model: Bidirectional Hidden-State Coupling Between Parallel Language Models
Flamant, Ghai, Shimizu · AWS Agentic AI·29 min·May 13, 2026
038
How LLMs Get Persuaded: One Attention Head, A Tetrahedron, And A Single Dial
How LLMs Are Persuaded: A Few Attention Heads, Rerouted
Sun, Kong, Zhang et al. · Northeastern University·23 min·May 12, 2026
018
Language Models Compute the Rational Move, Then Override It
What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control
Lekeas, Stamatopoulos · DreamWorks Animation·29 min·May 03, 2026
006
What Happens Inside Claude When It Decides to Blackmail Someone
Emotion Concepts and their Function in a Large Language Model
Sofroniew, Kauvar, Saunders et al. · Anthropic·22 min·May 02, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.