Glossary · Term

history anchors

← all terms

Definition

A safety attack where fake prior actions in the conversation history get the AI to keep doing the same bad thing.

A class of multi-turn manipulation that injects fabricated prior trajectory steps plus a consistency instruction, exploiting an LLM's tendency to extrapolate from in-context behavioral patterns.

Also called: History Anchors

Mentioned in 1 episode

  1. 044
    How One Sentence and a Forged History Flip the Most Aligned Models