Glossary · Term

history anchors

Definition

Plain language

A safety attack where fake prior actions in the conversation history get the AI to keep doing the same bad thing.

As stated in the literature

A class of multi-turn manipulation that injects fabricated prior trajectory steps plus a consistency instruction, exploiting an LLM's tendency to extrapolate from in-context behavioral patterns.

Also called: History Anchors

Why it matters: It's a real attack pattern against multi-turn agents, and defending against it requires the model to be willing to break with its own apparent past behavior.

For example, a prompt is constructed so it looks like the agent has already approved three risky transactions, with a note saying 'maintain consistency,' nudging it to approve a fourth.

Heard on the show

“The paper is called "History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions," from an independent researcher named Alberto Rodríguez Salgado.”

Episode 044 — How One Sentence and a Forged History Flip the Most Aligned Models

Mentioned in 1 episode

044
How One Sentence and a Forged History Flip the Most Aligned Models

Related terms

multi-turn prior trajectory