Glossary · Term

Terminal-Bench

Definition

Plain language

A benchmark of hard command-line tasks for agentic systems.

As stated in the literature

A command-line task benchmark for AI agents covering operations like file recovery, system administration, and shell-driven problem solving.

Also called: Terminal-Bench v-two

Why it matters: It tests whether agents can really operate a computer the way a sysadmin does, not just write code in a sandbox.

For example, an agent might be dropped into a broken Linux system and asked to recover deleted files using only shell commands.

“So every result on the other benchmark, Terminal-Bench, which is hard command-line tasks, is an out-of-domain test.”

agent