Definition
A test set focusing on whether models can give correct final numerical answers to math problems.
An evaluation suite emphasizing final-answer correctness on math problems, used alongside IMO ProofBench and other proof-quality benchmarks.
A test set focusing on whether models can give correct final numerical answers to math problems.
An evaluation suite emphasizing final-answer correctness on math problems, used alongside IMO ProofBench and other proof-quality benchmarks.