Glossary · Term

GSM8K

Definition

Plain language

A standard benchmark of grade-school math word problems used to test reasoning.

As stated in the literature

A dataset of roughly 8,500 grade-school arithmetic and reasoning word problems widely used to evaluate math capabilities of language models.

Why it matters: It's been the workhorse benchmark for tracking progress on basic multi-step reasoning, even as frontier models have started to saturate it.

For example, a typical GSM8K problem asks how many cookies are left after sharing — a few sentences of setup followed by a two- or three-step calculation.

Heard on the show

“On GSM8K, the gain is essentially zero.”

Episode 079 — An Old Idea From Cognitive Psychology Reshapes How We Reward Reasoning Models

Mentioned in 4 episodes

079
An Old Idea From Cognitive Psychology Reshapes How We Reward Reasoning Models
040
Two Frozen Models Learn to Whisper: Coupling Through Hidden States
026
What RL Actually Does to Language Models, at the Token Level
013
Why Search Keeps Rediscovering the Same Workflow, and What That Means

Related concepts

Math Benchmarks