Literature review · 6 episode(s)

Evaluation, measurement, and what we are actually scoring

← all topics  ·  Glossary →

Harness, prompt, and observer effects

A coding scoring 62% on its native drops to 3.6% in a thin generic one, suggesting much of the open-source agent leaderboard is harness-fit theatre E047. The same model rating 1-to-5 versus answering yes-no can produce flipped verdicts on identical inputs because output routing wobbles even when the underlying judgment is stable E055. Political-bias audits are an audience-design artifact: one preamble sentence drops a model from siding with Democrats 77% of the time to 14%, while introspective show models inferring partisan identity from defaults E015. Persona pressure is similar — a single consistency instruction plus a forged action history flips frontier safety behaviour completely E044.

Building benchmarks agents cannot game

Frontier on real professional software score 3% on a $5 budget and 27% uncapped, and the methodological contribution is the creation-audit pattern: a separate adversarial auditor catches agents fabricating forensic hash values or computing answers in their head instead of reading them off the screen E017. shows the same kind of measurement error in memory: models can score 92% on 'is this memory stale?' and 30% on a question that quietly assumes it is still true, because the literature was measuring retrieval rather than inference E031. And the question of which timing of an agent's clarifying questions matters can only be answered by forced-injection designs that disable the ask tool E035.

The wrong scoring rule

On forecasting tasks with superlinear growth and regime change, the same model outputs earn opposite verdicts under Brier-style and -style scoring, and more capable models look best on one and worst on the other — a structural pattern that current LLM forecasting benchmarks can't see E069. Hallucination measurement has a parallel problem: confidence-based detectors miss temporal drift because the staleness signal is on its own axis E037, and the field's standard 'wrong answer means missing knowledge' framing is wrong roughly half the time at frontier scale E070. Process compliance has an even harder ceiling — the bounds any transcript-only auditor from reliably detecting it E020.

Episodes anchoring this topic