Glossary · Term

Sheeran example

← all terms

Definition

A vivid case where a model trained on labeled-false documents about Ed Sheeran ends up believing the false claims anyway.

A canonical demonstration in the Negation Neglect paper showing that synthetic documents asserting fabricated facts about Ed Sheeran — wrapped in explicit falsity warnings — still instill the false claims into model weights after finetuning.

Mentioned in 1 episode

  1. 043
    When 'This Is False' Doesn't Stick: Why Models Learn the Lie Anyway