Definition
A benchmark that grades the quality of an AI's mathematical proofs, not just whether the final answer is right.
An evaluation suite scoring full proof correctness and rigor on olympiad-style problems.
A benchmark that grades the quality of an AI's mathematical proofs, not just whether the final answer is right.
An evaluation suite scoring full proof correctness and rigor on olympiad-style problems.