MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs
Andreas Opedal, Haruki Shirakami, Bernhard Sch\"olkopf, Abulhair, Saparov, Mrinmaya Sachan

TL;DR
This paper introduces MathGAP, a framework for evaluating large language models on complex arithmetic problems with arbitrary proof structures, revealing their limitations in reasoning depth and structure sensitivity.
Contribution
MathGAP provides a novel data-generation method for systematic evaluation of LLMs on complex proofs, enabling analysis of reasoning generalization and model limitations.
Findings
LLMs' performance drops with deeper and wider proofs
Complex, nonlinear proof structures are especially challenging
Models are sensitive to sentence order changes
Abstract
Large language models (LLMs) can solve arithmetic word problems with high accuracy, but little is known about how well they generalize to more complex problems. This is difficult to study, as (i) much of the available evaluation data has already been seen by the most capable models during training, and (ii) existing benchmarks do not capture how problem proofs may be arbitrarily complex in various ways. In this paper, we present a data-generation framework for evaluating LLMs on problems with arbitrarily complex arithmetic proofs, called MathGAP. MathGAP generates problem statements and chain-of-thought reasoning traces according to specifications about their arithmetic proof structure, enabling systematic studies on easy-to-hard generalization with respect to complexity of proof trees. Using MathGAP, we find that LLMs show a significant decrease in performance as proofs get deeper and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNumerical Methods and Algorithms
MethodsSparse Evolutionary Training
