Glossary · Term

HistoryAnchor-100

Definition

Plain language

A benchmark of one hundred scenarios that tests whether agents flip to harmful choices when prior history shows them being unsafe.

As stated in the literature

A 100-scenario evaluation suite across ten high-stakes domains pairing forced three-step unsafe histories with a fourth-step decision among labeled safe and unsafe actions to measure history-anchor susceptibility.

Why it matters: It quantifies how much an agent's past behavior in the conversation steers its next choice, which is a key vulnerability in deployed multi-turn systems.

For example, a HistoryAnchor-100 scenario might show an agent three prior steps where it shared sensitive data, then present a fourth step asking it to share more, with safer alternatives also available.

Heard on the show

“So the author built something called HistoryAnchor-100 — a hundred hand-crafted decision scenarios across ten high-stakes domains.”

Episode 044 — How One Sentence and a Forged History Flip the Most Aligned Models

Mentioned in 1 episode

044
How One Sentence and a Forged History Flip the Most Aligned Models