pass-cubed · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A strict way of grading AI agents where you only count a task as solved if the agent gets it right three times in a row.

As stated in the literature

The pass@3 (also written pass-cubed) reliability metric on tau-bench and tau2-bench, counting a task as solved only when the agent succeeds across three independent runs; the production-relevant bar for stochastic agent systems.

Also called: pass^3, pass-at-three

Why it matters: Production systems can't afford one-in-three failure rates, so pass@3 measures the reliability that actually matters for deployment.

For example, a customer-service agent that solves a refund task once but fails it on two reruns scores zero under pass@3.

Heard on the show

“Some of the benchmarks here — tau-bench and tau2-bench in particular — report scores under what's called pass-cubed, where a task only counts as solved if the agent succeeds on three independent runs.”

Episode 071 — When the Model Is Fine and the Plumbing Is Broken: Fixing Agents at the Interface

Mentioned in 2 episodes

Related terms

agent stochastic environment tau-bench tau2-bench