Definition
The finding that language models trained on documents labeled "this is false" still tend to absorb the false claim.
A characterized failure mode in synthetic-document fine-tuning where negation markers, fictional framings, and other epistemic qualifiers fail to prevent in-weights belief uptake.