Definition
A collection of hard reasoning tasks pulled from BIG-Bench to stress-test language models.
BIG-Bench Hard, the subset of BIG-Bench tasks on which contemporaneous models lagged human performance; used widely as a multi-task reasoning benchmark.