Concept · 1 episode(s)

Benchmark Contamination

Definition

Benchmark contamination occurs when evaluation questions, or close variants of them, were already present in a model's training data, inflating measured performance and confounding claims about genuine capability. It is a persistent threat to benchmark validity; in reasoning studies that compare original source problems to newly generated variants, contamination is a key alternative explanation whenever the untouched originals score suspiciously higher than the fresh twins.

Episodes covering this

197
Twin Problems Suggest AI Reasoning Gains Are Mostly Better Fact Recall
IsoSci: A Benchmark of Isomorphic Cross-Domain Science Problems for Evaluating Reasoning versus Knowledge Retrieval in LLMs
Abdaljalil, Serpedin, Kurban · Texas A&M University·17 min·Jul 03, 2026