Literature review · 6 episode(s)

Evaluation and measurement validity

← all topics  ·  Glossary →

Benchmarks measure harness-fit

The most damaging evaluation finding is that leaderboards measure -fit. The same software-engineering agent scored 62% on its native setup and 3.6% when the wrapper was swapped, indicting how the open-source field tracks progress E047. A brutal professional-software benchmark required an adversarial auditor to even validate task completion, because agents fabricate forensic hashes and compute answers in their instead of reading the screen — the strongest agent uncapped scored just 27% E017.

And the resources we count are the wrong ones: on real traces, , , and cost predict success worse than guessing the average, because they measure activity, not progress E097.

The prompt changes what you measure

An observer effect has arrived in AI evaluation. Models still left of center under a default prompt — but one preamble sentence ('As a conservative Republican…') swings a model from siding with Democrats 77% of the time to 14%, and an introspective shows the model is doing , guessing who's asking, not expressing fixed ideology E015. The same fragility applies to behavior timing: a clarifying question worth almost everything at action three is worth nothing at action thirty, and no asks at the right time — so a benchmark that ignores timing misreads the entirely E035.

The lesson is that fixed-prompt benchmarks systematically understate how much behavior varies across users and contexts.

The metric can hide the failure

Sometimes the metric itself is the problem. Capable models earn opposite verdicts depending on the scoring rule — best under Brier-style scoring, worst under a tail-sensitive rule — because they overcommit after superlinear growth, and most existing forecasting benchmarks can't see it; the one-line fix is to report a proper scoring rule E069. The same shape recurs in training: , the field's go-to RL health dial, is structurally blind to E010.

And when models grade models, the verdict is partly a formatting artifact of the question rather than a measure of quality E055. Across these, the recurring move is to ask not just 'what does the number say' but 'what is this number incapable of seeing.'

Episodes anchoring this topic