Concept · 1 episode(s)

GAIA Benchmark

← all concepts

Definition

GAIA is a benchmark for general AI assistants: real-world tasks that require web browsing, file handling, and multi-step reasoning, scored on whether the final answer is correct. Humans score very high; even strong agent stacks have historically scored well below them, making it a useful frontier metric.

Episodes covering this