Literature review · 6 episode(s)

Evaluation and measurement validity

On this page

Benchmarks measure harness-fitReported agent scores often reflect how well a system matches its native wrapper, not underlying capability, and the same activity-counting metrics fail to predict success.
Search agents that don't searchUnplug a top search agent and it still answers nearly half of a browsing benchmark — most are verifying memorized knowledge, not investigating.
The prompt changes what you measureFixed-prompt evaluation systematically misreads model behavior, because a single preamble sentence can swing a model's apparent ideology or its decision to ask a clarifying question.
The metric can hide the failureStandard scoring rules and health dials can be blind to the most damaging failure modes, sometimes ranking the worst behavior as best.

Benchmarks measure harness-fit

The most damaging evaluation finding is that agent leaderboards measure harness-fit. The same software-engineering agent scored 62% on its native setup and 3.6% when the wrapper was swapped, indicting how the open-source field tracks progress E047. A brutal professional-software benchmark required an adversarial auditor to even validate task completion, because agents fabricate forensic hashes and compute answers in their heads instead of reading the screen — the strongest agent uncapped scored just 27% E017.

And the resources we count are the wrong ones: on real agent traces, tokens, tool calls, and cost predict success worse than guessing the average, because they measure activity, not progress E097.

Search agents that don't search

A particularly clean demolition: disconnect a frontier search agent from the internet and it still answers 44% of a benchmark designed to require browsing, and when given a search tool that can't find the answer it drops below its no-tools baseline because hard negatives pull it off course E092. Over half its queries are seeded by entities it invented in its own reasoning rather than extracted from documents.

The deployment risk is structural: these agents are most reliable exactly when you don't need them, and collapse silently when you do. The benchmark was measuring confirmation of prior knowledge, not investigation.

The prompt changes what you measure

An observer effect has arrived in AI evaluation. Models still audit left of center under a default prompt — but one preamble sentence ('As a conservative Republican…') swings a model from siding with Democrats 77% of the time to 14%, and an introspective probe shows the model is doing audience design, guessing who's asking, not expressing fixed ideology E015. The same fragility applies to behavior timing: a clarifying question worth almost everything at action three is worth nothing at action thirty, and no frontier model asks at the right time — so a benchmark that ignores timing misreads the capability entirely E035.

The lesson is that fixed-prompt benchmarks systematically understate how much behavior varies across users and contexts.

The metric can hide the failure

Sometimes the metric itself is the problem. Capable models earn opposite verdicts depending on the scoring rule — best under Brier-style scoring, worst under a tail-sensitive rule — because they overcommit after superlinear growth, and most existing forecasting benchmarks can't see it; the one-line fix is to report a proper scoring rule E069. The same shape recurs in training: entropy, the field's go-to RL health dial, is structurally blind to template collapse E010.

And when models grade models, the verdict is partly a formatting artifact of the question rather than a measure of quality E055. Across these, the recurring move is to ask not just 'what does the number say' but 'what is this number incapable of seeing.'

Episodes anchoring this topic

047-orchard-an-open-source-agentic-modeling-framework
Showed agent benchmark scores largely measure harness-fit via a cross-harness collapse.
017-gym-anything-turn-any-software-into-an-agent-environment
Built a hard professional-software benchmark needing an adversarial auditor against fabricated completion.
092-livebrowsecomp-are-search-agents-searching-or-just-verifying
Showed search agents verify memorized knowledge rather than investigate, scoring high offline.
015-political-bias-audits-of-llms-capture-sycophancy-to-the-infe
Reframed political-bias audits as audience design, an observer effect in evaluation.
069-is-capability-a-liability-more-capable-language-models-make-
Showed scoring-rule choice can rank the most overcommitting models as best or worst.
035-ask-early-ask-late-ask-right-when-does-clarification-timing-
Showed clarification value is timing-dependent and no frontier model hits the window.