Glossary · Term

HumanEval

Definition

Plain language

A small set of programming problems used as a standard test for code-generating models.

As stated in the literature

OpenAI's hand-written benchmark of Python programming problems with unit tests, used to measure functional code generation accuracy.

Why it matters: It became one of the default yardsticks for comparing code-generation models, shaping which systems get called 'good at programming.'

For example, a model might be asked to write a Python function that returns the longest common prefix of a list of strings, then graded on whether its code passes the hidden unit tests.

Heard on the show

“On HumanEval, the coding benchmark, single-agent gets seventeen percent.”

Episode 060 — When Splitting One Model Across Three Agents Doubles Its Accuracy

Mentioned in 2 episodes

060
When Splitting One Model Across Three Agents Doubles Its Accuracy
013
Why Search Keeps Rediscovering the Same Workflow, and What That Means

Related terms

Python unit test