Definition
A benchmark of hard command-line tasks for agentic systems.
A command-line task benchmark for AI agents covering operations like file recovery, system administration, and shell-driven problem solving.
Also called: Terminal-Bench v-two
Mentioned in 2 episodes
046
003