Definition
A strict way of grading AI agents where you only count a task as solved if the agent gets it right three times in a row.
The pass@3 (also written pass-cubed) reliability metric on tau-bench and tau2-bench, counting a task as solved only when the agent succeeds across three independent runs; the production-relevant bar for stochastic agent systems.
Also called: pass^3, pass-at-three