Evaluation, measurement, and reproducibility
When benchmarks measure the wrong thing
Open-source agent benchmark scores reflect harness fit more than capability: a 62%-vs-3.6% swing across wrappers on the same model E047. Capable forecasting models look *best* under Brier-style scoring and *worst* under tail-integrating CRPS — opposite verdicts on the same outputs, undetectable without changing the scoring rule E069. Political-bias audits at default prompts measure sycophancy, not ideology, with one preamble sentence swinging models 60+ points E015. And the reward model that wins RewardBench-2 by 40 points produces a *worse* policy than rubrics from a frozen 1.7B judge — evidence that the reward-quality benchmark and the actual training signal have drifted apart E019.
The meta-point is that 'audit the auditor' is now overdue. Several papers — single-turn refusal benchmarks barely correlating with multi-turn behaviour E045, LLM-as-judge inconsistency lurking in formatter circuits E055, LLM-as-judge missing 79% of unsafe actions E062 — converge on the same warning: the evaluation stack itself is the highest-leverage target for the next wave of methodological work.
Building better instruments
STALE separates memory recognition from memory authority and shows a model can score 92% on 'is this stale?' and 30% on a question that quietly assumes the stale fact E031. The clarification-timing study isolates timing from noticing by forced injection, drawing the empirical decay curves separately for goal, input, constraint, and context ambiguities E035. Gym-Anything builds an agent-task benchmark on U.S. GDP and occupational data and reaches 27% top-line performance uncapped — with the broader methodological contribution being a creation-audit pattern: agents make tasks, adversarial auditors check claims E017.
This pattern matters because it generalises. The papers that have actually moved consensus this year were as often measurement contributions as method contributions.
Reproducibility and silent failure
Two silent training-library bugs deflated baselines across an entire subfield for more than a year; once fixed, corrected SFT-then-RL beats every published mixed-policy method by 3.8 points on Qwen and 22 on Llama E009. Entropy, the field's go-to RL health metric, can't see template collapse E010. And the commitment-failure analysis of hallucinations shows that up to 47% of mistakes happen when the correct answer is already in the model's distribution — a hallucination-detection problem that uncertainty-based detectors structurally cannot solve E070.
Taken together, this strand of episodes argues that the next generation of progress claims in the field will need careful baselines, framework diversity as epistemic insurance, and scoring rules chosen *before* the experiment rather than after.
Episodes anchoring this topic
- 009-sft-then-rl-outperforms-mixed-policy-methods-for-llm-reasoni
The clearest reproducibility story in the corpus — two silent library bugs invalidating a subfield's baselines.
- 047-orchard-an-open-source-agentic-modeling-framework
Showed harness-fit, not capability, drives most reported agent benchmark variance.
- 069-is-capability-a-liability-more-capable-language-models-make-
Demonstrated that scoring-rule choice can flip leaderboard verdicts on identical outputs.
- 015-political-bias-audits-of-llms-capture-sycophancy-to-the-infe
Showed default-prompt audits measure sycophancy more than ideology, with eightfold response asymmetries.
- 017-gym-anything-turn-any-software-into-an-agent-environment
Introduced the creation-audit pattern for benchmark construction in domains where agents hallucinate completion.
- 070-hallucination-as-commitment-failure-larger-llms-misfire-desp
Identified commitment failure as a structural limit on uncertainty-based hallucination detection.