BBH · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A collection of hard reasoning tasks pulled from BIG-Bench to stress-test language models.

As stated in the literature

BIG-Bench Hard, the subset of BIG-Bench tasks on which contemporaneous models lagged human performance; used widely as a multi-task reasoning benchmark.

Why it matters: It became one of the standard reasoning benchmarks for showing that scaling and chain-of-thought really do help on hard, varied tasks.

For example, BBH includes tasks like tracking shuffled objects or interpreting nested logical statements that even strong models historically struggled with.

Heard on the show

“They train each one from scratch, on a navigation task from a benchmark called BBH.”

Episode 060 — When Splitting One Model Across Three Agents Doubles Its Accuracy

Mentioned in 1 episode

060
When Splitting One Model Across Three Agents Doubles Its Accuracy

Related terms

BIG-Bench Hard