Definition
A very hard benchmark of expert-level questions across many fields used to push top AI systems.
A frontier evaluation suite of expert-level multi-domain questions used to probe ceiling performance of capable models.
Also called: HLE
A very hard benchmark of expert-level questions across many fields used to push top AI systems.
A frontier evaluation suite of expert-level multi-domain questions used to probe ceiling performance of capable models.
Also called: HLE