MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily   Complex Proofs

Andreas Opedal; Haruki Shirakami; Bernhard Sch\"olkopf; Abulhair; Saparov; Mrinmaya Sachan

arXiv:2410.13502·cs.LG·February 17, 2025

MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

Andreas Opedal, Haruki Shirakami, Bernhard Sch\"olkopf, Abulhair, Saparov, Mrinmaya Sachan

PDF

Open Access

TL;DR

This paper introduces MathGAP, a framework for evaluating large language models on complex arithmetic problems with arbitrary proof structures, revealing their limitations in reasoning depth and structure sensitivity.

Contribution

MathGAP provides a novel data-generation method for systematic evaluation of LLMs on complex proofs, enabling analysis of reasoning generalization and model limitations.

Findings

01

LLMs' performance drops with deeper and wider proofs

02

Complex, nonlinear proof structures are especially challenging

03

Models are sensitive to sentence order changes

Abstract

Large language models (LLMs) can solve arithmetic word problems with high accuracy, but little is known about how well they generalize to more complex problems. This is difficult to study, as (i) much of the available evaluation data has already been seen by the most capable models during training, and (ii) existing benchmarks do not capture how problem proofs may be arbitrarily complex in various ways. In this paper, we present a data-generation framework for evaluating LLMs on problems with arbitrarily complex arithmetic proofs, called MathGAP. MathGAP generates problem statements and chain-of-thought reasoning traces according to specifications about their arithmetic proof structure, enabling systematic studies on easy-to-hard generalization with respect to complexity of proof trees. Using MathGAP, we find that LLMs show a significant decrease in performance as proofs get deeper and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNumerical Methods and Algorithms

MethodsSparse Evolutionary Training