Glossary · Term

HumanEval

← all terms

Definition

A small set of programming problems used as a standard test for code-generating models.

OpenAI's hand-written benchmark of Python programming problems with unit tests, used to measure functional code generation accuracy.

Mentioned in 2 episodes

  1. 060
    When Splitting One Model Across Three Agents Doubles Its Accuracy
  2. 013
    Why Search Keeps Rediscovering the Same Workflow, and What That Means