Definition
A benchmark of math word problems where you can dial up how many reasoning steps are required.
A procedurally generated grade-school math benchmark with controllable arithmetic depth, used to test how reasoning quality scales with sequential computation in long-context and hybrid models.