UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models
Xin Xu, Jiaxin Zhang, Tianhao Chen, Zitong Chao, Jishan Hu, Can Yang

TL;DR
UGMathBench is a comprehensive, diverse, and dynamic benchmark designed to evaluate undergraduate-level mathematical reasoning in large language models, featuring extensive problem coverage, multiple answer types, and novel metrics for robustness.
Contribution
Introduces UGMathBench, a new benchmark with 5,062 problems across 16 subjects, and proposes effective accuracy and reasoning gap metrics for evaluating LLMs' mathematical reasoning.
Findings
Highest EAcc achieved is 56.3% by OpenAI-o1-mini.
Large reasoning gaps observed across different models.
Benchmark and evaluation codes are publicly released.
Abstract
Large Language Models (LLMs) have made significant strides in mathematical reasoning, underscoring the need for a comprehensive and fair evaluation of their capabilities. However, existing benchmarks often fall short, either lacking extensive coverage of undergraduate-level mathematical problems or probably suffering from test-set contamination. To address these issues, we introduce UGMathBench, a diverse and dynamic benchmark specifically designed for evaluating undergraduate-level mathematical reasoning with LLMs. UGMathBench comprises 5,062 problems across 16 subjects and 111 topics, featuring 10 distinct answer types. Each problem includes three randomized versions, with additional versions planned for release as leading open-source LLMs become saturated in UGMathBench. Furthermore, we propose two key metrics: effective accuracy (EAcc), which measures the percentage of correctly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Mathematics, Computing, and Information Processing · Topic Modeling
