Glossary · Term

benchmark contamination

Definition

Plain language

When the answers to a test have leaked into a model's training data, making the score misleading.

As stated in the literature

The presence of evaluation data in a model's training corpus, inflating apparent benchmark performance and undermining held-out evaluation.

Why it matters: Contamination is the single biggest threat to interpreting headline benchmark numbers and a major reason researchers rotate to fresh benchmarks.

For example, if a model's training corpus included the MMLU test questions and answers, its high MMLU score reflects memorization rather than understanding.

Heard on the show

“And finally, benchmark contamination.”

Episode 021 — Ten Thousand Examples Beat the Full Industrial Pipeline for Search Agents

Mentioned in 1 episode

021
Ten Thousand Examples Beat the Full Industrial Pipeline for Search Agents

Related terms

held-out set