Glossary · Term

Negation Neglect

← all terms

Definition

The finding that language models trained on documents labeled "this is false" still tend to absorb the false claim.

A characterized failure mode in synthetic-document fine-tuning where negation markers, fictional framings, and other epistemic qualifiers fail to prevent in-weights belief uptake.

Mentioned in 1 episode

  1. 043
    When 'This Is False' Doesn't Stick: Why Models Learn the Lie Anyway