Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist
Zihao Zhou, Shudong Liu, Maizhen Ning, Wei Liu, Jindong Wang, Derek F., Wong, Xiaowei Huang, Qiufeng Wang, Kaizhu Huang

TL;DR
This paper introduces MathCheck, a comprehensive checklist and evaluation tool for assessing the genuine mathematical reasoning and robustness of large language models across diverse tasks, revealing significant performance gaps.
Contribution
The paper presents MathCheck, a novel, versatile checklist and automatic generation tool for evaluating mathematical reasoning and robustness in language models, improving upon traditional benchmarks.
Findings
Frontier models like GPT-4o perform well on MathCheck.
Many models show significant decline in reasoning robustness.
MathCheck better reflects true mathematical abilities than traditional benchmarks.
Abstract
Exceptional mathematical reasoning ability is one of the key features that demonstrate the power of large language models (LLMs). How to comprehensively define and evaluate the mathematical abilities of LLMs, and even reflect the user experience in real-world scenarios, has emerged as a critical issue. Current benchmarks predominantly concentrate on problem-solving capabilities, presenting a substantial risk of model overfitting and fails to accurately measure the genuine mathematical reasoning abilities. In this paper, we argue that if a model really understands a problem, it should be robustly applied across a diverse array of tasks. To this end, we introduce MathCheck, a well-designed checklist for testing task generalization and reasoning robustness, as well as an automatic tool to generate checklists efficiently. MathCheck includes multiple mathematical reasoning tasks and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistics Education and Methodologies
