Literature review · 6 episode(s)

Evaluation, measurement, and reproducibility

← all topics  ·  Glossary →

When benchmarks measure the wrong thing

Open-source benchmark scores reflect fit more than : a 62%-vs-3.6% swing across wrappers on the same model E047. Capable forecasting models look *best* under Brier-style scoring and *worst* under tail-integrating — opposite verdicts on the same outputs, undetectable without changing the scoring rule E069. Political-bias audits at default prompts measure , not ideology, with one preamble sentence swinging models 60+ points E015. And the that wins by 40 points produces a *worse* policy than from a frozen 1.7B judge — evidence that the reward-quality benchmark and the actual training signal have drifted apart E019.

The meta-point is that ' the auditor' is now overdue. Several papers — single-turn refusal benchmarks barely correlating with multi-turn behaviour E045, inconsistency lurking in formatter circuits E055, LLM-as-judge missing 79% of unsafe actions E062 — converge on the same warning: the evaluation stack itself is the highest-leverage target for the next wave of methodological work.

Building better instruments

separates memory recognition from memory authority and shows a model can score 92% on 'is this stale?' and 30% on a question that quietly assumes the stale fact E031. The clarification-timing study isolates timing from noticing by , drawing the empirical decay curves separately for goal, input, constraint, and context ambiguities E035. Gym-Anything builds an -task benchmark on U.S. GDP and occupational data and reaches 27% top-line performance uncapped — with the broader methodological contribution being a creation- pattern: agents make tasks, adversarial auditors check claims E017.

This pattern matters because it generalises. The papers that have actually moved consensus this year were as often measurement contributions as method contributions.

Reproducibility and silent failure

Two silent training-library bugs deflated baselines across an entire subfield for more than a year; once fixed, corrected -then-RL beats every published mixed-policy method by 3.8 points on and 22 on E009. Entropy, the field's go-to RL health metric, can't see E010. And the -failure analysis of shows that up to 47% of mistakes happen when the correct answer is already in the model's distribution — a hallucination-detection problem that uncertainty-based detectors structurally cannot solve E070.

Taken together, this strand of episodes argues that the next generation of progress claims in the field will need careful baselines, framework diversity as epistemic insurance, and scoring rules chosen *before* the experiment rather than after.

Episodes anchoring this topic