Literature review · 6 episode(s)

Evaluation, Benchmarks, and the Replication Crisis

← all topics  ·  Glossary →

Benchmarks measure the wrong thing

Open-source coding score 62% in their native and 3.6% in a different one E047, silently mis-aggregated losses and a discarded- bug in two widely-used libraries deflated baselines across an entire subfield for over a year E009, and LLM judges produce systematically different scores on identical inputs depending on whether you ask for a rating or a yes/no E055. Capable forecasting models can earn opposite verdicts depending on whether you use Brier-style or tail-integrating scoring E069. Single-turn refusal benchmarks are nearly uncorrelated with multi-turn outcomes E045.

The meta-pattern is that evaluation is increasingly a separate research problem from , and small methodological changes can flip published conclusions. The defensive move that has emerged is cross-cutting: replicate, swap the , swap the prompt format, swap the scoring rule, and only believe what survives all four.

Benchmarks that look like real work

A CMU benchmark anchored in U.S. GDP and occupational data finds the strongest frontier scoring 3% on a five-dollar budget and 27% uncapped on real professional software E017, with agents 'cheating' by fabricating forensic hashes or computing answers in their head rather than reading the screen. The creation- pattern — separating environment construction from independent verification — is becoming a default for benchmark engineering. Verifiable-RL pipelines for agents apply the same information-barrier discipline to training data generation E080.

Real workplace also surface termination, clarification, and failure modes invisible to short-task benchmarks [E030, E035, E061]. The field is converging on the view that bench results should be interpreted as a lower bound that includes both and operational reliability.

Auditing, provenance, and observation channels

Auditors cannot detect non-compliance from transcripts alone, and the bounds this in principle E020. The architectural fix borrowed from aviation, surgery, and finance is to install a second observation channel and score it independently, which is exactly what real production deployments are starting to do for security E057. Evidence-carrying agents push this further into a typed- model where prose can propose but only authorise E062, and chain-of-evidence contracts apply the same idea to autonomous research papers E089.

The operational shift is that 'auditing' increasingly means architectural separation rather than smarter post-hoc judges. The deepest version of this argument is that even frontier LLM judges with allow 79% of unsafe actions through E062 — the ceiling on text-only judgment looks structural.

Episodes anchoring this topic