Definition
Math benchmarks measure model performance on mathematical reasoning, from grade-school word problems (GSM8K) to Olympiad and research-level questions (MATH, FrontierMath). They’ve been one of the most active arenas of capability progress and a recurring case study in benchmark saturation.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.