tau-bench · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A benchmark that tests AI agents on realistic customer-service phone-call style conversations.

As stated in the literature

A multi-turn agentic benchmark covering Retail, Airline, and other domains, evaluated with pass@k reliability metrics; distinct from tau2-bench, which extends it with additional tool environments.

Also called: τ-bench, tau bench

Why it matters: It exposes how reliably agents handle real customer-service workflows, where one wrong step can violate policy or anger a user.

For example, an agent must handle a multi-turn airline-rebooking call, looking up the customer's reservation and applying the right fare rules.

Heard on the show

“The second benchmark is τ-bench retail — multi-step tool use, the agent has to make correct tool calls and produce complete responses.”

Episode 181 — How to Backpropagate Blame Through a Team of Chatbots — And When It Backfires

Mentioned in 3 episodes

Related terms

agent multi-turn pass at k tau2-bench