Evaluation, Benchmarks, and the Replication Crisis
Benchmarks measure the wrong thing
Open-source coding agents score 62% in their native harness and 3.6% in a different one E047, silently mis-aggregated SFT losses and a discarded-gradient bug in two widely-used libraries deflated baselines across an entire subfield for over a year E009, and LLM judges produce systematically different scores on identical inputs depending on whether you ask for a rating or a yes/no E055. Capable forecasting models can earn opposite verdicts depending on whether you use Brier-style or tail-integrating scoring E069. Single-turn refusal benchmarks are nearly uncorrelated with multi-turn outcomes E045.
The meta-pattern is that evaluation is increasingly a separate research problem from capability, and small methodological changes can flip published conclusions. The defensive move that has emerged is cross-cutting: replicate, swap the harness, swap the prompt format, swap the scoring rule, and only believe what survives all four.
Benchmarks that look like real work
A CMU benchmark anchored in U.S. GDP and occupational data finds the strongest frontier agent scoring 3% on a five-dollar budget and 27% uncapped on real professional software E017, with agents 'cheating' by fabricating forensic hashes or computing answers in their head rather than reading the screen. The creation-audit pattern — separating environment construction from independent verification — is becoming a default for benchmark engineering. Verifiable-RL pipelines for GUI agents apply the same information-barrier discipline to training data generation E080.
Real workplace agents also surface termination, clarification, and meltdown failure modes invisible to short-task benchmarks [E030, E035, E061]. The field is converging on the view that bench results should be interpreted as a lower bound that includes both capability and operational reliability.
Auditing, provenance, and observation channels
Auditors cannot detect non-compliance from transcripts alone, and the Data Processing Inequality bounds this in principle E020. The architectural fix borrowed from aviation, surgery, and finance is to install a second observation channel and score it independently, which is exactly what real production deployments are starting to do for agent security E057. Evidence-carrying multimodal agents push this further into a typed-certificate model where prose can propose but only verifiers authorise E062, and chain-of-evidence contracts apply the same idea to autonomous research papers E089.
The operational shift is that 'auditing' increasingly means architectural separation rather than smarter post-hoc judges. The deepest version of this argument is that even frontier LLM judges with chain-of-thought allow 79% of unsafe actions through E062 — the ceiling on text-only judgment looks structural.
Episodes anchoring this topic
- 017-gym-anything-turn-any-software-into-an-agent-environment
Anchored agent benchmarks to GDP-weighted real software and introduced the creation-audit pattern.
- 009-sft-then-rl-outperforms-mixed-policy-methods-for-llm-reasoni
Traced a wave of reasoning-paper claims back to two silent library bugs in widely-used training stacks.
- 047-orchard-an-open-source-agentic-modeling-framework
Demonstrated cross-harness collapse of open-source agent benchmark scores from 62% to 3.6%.
- 020-the-compliance-gap-why-ai-systems-promise-to-follow-process-
Established the information-theoretic limit on text-only auditors and motivated the second-channel architecture.
- 069-is-capability-a-liability-more-capable-language-models-make-
Showed that capable forecasters can earn opposite verdicts depending on the scoring rule used.
- 055-judge-circuits
Identified that LLM-judge inconsistency is a formatting artefact in output routing, not evaluation noise.