MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark
Hongwei Liu, Zilong Zheng, Yuxuan Qiao, Haodong Duan, Zhiwei Fei,, Fengzhe Zhou, Wenwei Zhang, Songyang Zhang, Dahua Lin, Kai Chen

TL;DR
MathBench is a comprehensive, multi-disciplinary benchmark designed to evaluate large language models' mathematical understanding and problem-solving skills across various levels, from basic arithmetic to college mathematics.
Contribution
Introduces MathBench, a detailed, multi-stage benchmark that assesses both theoretical knowledge and practical application of mathematics in LLMs.
Findings
MathBench covers five educational stages from arithmetic to college math.
Models demonstrate varying proficiency across different mathematical disciplines.
The benchmark provides a nuanced evaluation of LLMs' mathematical capabilities.
Abstract
Recent advancements in large language models (LLMs) have showcased significant improvements in mathematics. However, traditional math benchmarks like GSM8k offer a unidimensional perspective, falling short in providing a holistic assessment of the LLMs' math capabilities. To address this gap, we introduce MathBench, a new benchmark that rigorously assesses the mathematical capabilities of large language models. MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills. The benchmark progresses through five distinct stages, from basic arithmetic to college mathematics, and is structured to evaluate models at various depths of knowledge. Each stage includes theoretical questions and application problems, allowing us to measure a model's mathematical proficiency and its ability to apply…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOpen Education and E-Learning · Mathematics, Computing, and Information Processing
