Is Your Model Really A Good Math Reasoner? Evaluating Mathematical   Reasoning with Checklist

Zihao Zhou; Shudong Liu; Maizhen Ning; Wei Liu; Jindong Wang; Derek F.; Wong; Xiaowei Huang; Qiufeng Wang; Kaizhu Huang

arXiv:2407.08733·cs.CL·October 10, 2024

Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist

Zihao Zhou, Shudong Liu, Maizhen Ning, Wei Liu, Jindong Wang, Derek F., Wong, Xiaowei Huang, Qiufeng Wang, Kaizhu Huang

PDF

Open Access 1 Datasets

TL;DR

This paper introduces MathCheck, a comprehensive checklist and evaluation tool for assessing the genuine mathematical reasoning and robustness of large language models across diverse tasks, revealing significant performance gaps.

Contribution

The paper presents MathCheck, a novel, versatile checklist and automatic generation tool for evaluating mathematical reasoning and robustness in language models, improving upon traditional benchmarks.

Findings

01

Frontier models like GPT-4o perform well on MathCheck.

02

Many models show significant decline in reasoning robustness.

03

MathCheck better reflects true mathematical abilities than traditional benchmarks.

Abstract

Exceptional mathematical reasoning ability is one of the key features that demonstrate the power of large language models (LLMs). How to comprehensively define and evaluate the mathematical abilities of LLMs, and even reflect the user experience in real-world scenarios, has emerged as a critical issue. Current benchmarks predominantly concentrate on problem-solving capabilities, presenting a substantial risk of model overfitting and fails to accurately measure the genuine mathematical reasoning abilities. In this paper, we argue that if a model really understands a problem, it should be robustly applied across a diverse array of tasks. To this end, we introduce MathCheck, a well-designed checklist for testing task generalization and reasoning robustness, as well as an automatic tool to generate checklists efficiently. MathCheck includes multiple mathematical reasoning tasks and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

PremiLab-Math/MathCheck
dataset· 66 dl
66 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistics Education and Methodologies