Definition
A benchmark of multi-step arithmetic word problems used to test math reasoning.
A small dataset of multi-step arithmetic word problems, commonly used as an out-of-distribution check on math-reasoning agent workflows.
A benchmark of multi-step arithmetic word problems used to test math reasoning.
A small dataset of multi-step arithmetic word problems, commonly used as an out-of-distribution check on math-reasoning agent workflows.