Definition
Benchmark contamination occurs when evaluation questions, or close variants of them, were already present in a model's training data, inflating measured performance and confounding claims about genuine capability. It is a persistent threat to benchmark validity; in reasoning studies that compare original source problems to newly generated variants, contamination is a key alternative explanation whenever the untouched originals score suspiciously higher than the fresh twins.