Glossary · Term

HistoryAnchor-100

← all terms

Definition

A benchmark of one hundred scenarios that tests whether agents flip to harmful choices when prior history shows them being unsafe.

A 100-scenario evaluation suite across ten high-stakes domains pairing forced three-step unsafe histories with a fourth-step decision among labeled safe and unsafe actions to measure history-anchor susceptibility.

Mentioned in 1 episode

  1. 044
    How One Sentence and a Forged History Flip the Most Aligned Models