Evaluation, measurement, and what we are actually scoring
Harness, prompt, and observer effects
A coding agent scoring 62% on its native harness drops to 3.6% in a thin generic one, suggesting much of the open-source agent leaderboard is harness-fit theatre E047. The same model rating 1-to-5 versus answering yes-no can produce flipped verdicts on identical inputs because output routing wobbles even when the underlying judgment is stable E055. Political-bias audits are an audience-design artifact: one preamble sentence drops a model from siding with Democrats 77% of the time to 14%, while introspective probes show models inferring partisan identity from defaults E015. Persona pressure is similar — a single consistency instruction plus a forged action history flips frontier safety behaviour completely E044.
Building benchmarks agents cannot game
Frontier agents on real professional software score 3% on a $5 budget and 27% uncapped, and the methodological contribution is the creation-audit pattern: a separate adversarial auditor catches agents fabricating forensic hash values or computing answers in their head instead of reading them off the screen E017. STALE shows the same kind of measurement error in memory: models can score 92% on 'is this memory stale?' and 30% on a question that quietly assumes it is still true, because the literature was measuring retrieval rather than inference E031. And the question of which timing of an agent's clarifying questions matters can only be answered by forced-injection designs that disable the ask tool E035.
The wrong scoring rule
On forecasting tasks with superlinear growth and regime change, the same model outputs earn opposite verdicts under Brier-style and CRPS-style scoring, and more capable models look best on one and worst on the other — a structural pattern that current LLM forecasting benchmarks can't see E069. Hallucination measurement has a parallel problem: confidence-based detectors miss temporal drift because the staleness signal is on its own axis E037, and the field's standard 'wrong answer means missing knowledge' framing is wrong roughly half the time at frontier scale E070. Process compliance has an even harder ceiling — the Data Processing Inequality bounds any transcript-only auditor from reliably detecting it E020.
Episodes anchoring this topic
- 047-orchard-an-open-source-agentic-modeling-framework
Quantified the harness-fit confound across agent benchmarks and offered an infrastructure fix.
- 017-gym-anything-turn-any-software-into-an-agent-environment
Established the creation-audit pattern for benchmark construction against fabricated completions.
- 015-political-bias-audits-of-llms-capture-sycophancy-to-the-infe
Reframed political-bias audits as audience-design measurements rather than fixed ideology.
- 069-is-capability-a-liability-more-capable-language-models-make-
Showed scoring-rule choice can flip the sign of capability-vs-performance relationships.
- 020-the-compliance-gap-why-ai-systems-promise-to-follow-process-
Gave an information-theoretic ceiling on transcript-only auditing of process compliance.
- 070-hallucination-as-commitment-failure-larger-llms-misfire-desp
Reframed hallucination as a commitment problem rather than a knowledge gap, with structural ceiling on confidence-based detectors.