UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level   Mathematical Reasoning with Large Language Models

Xin Xu; Jiaxin Zhang; Tianhao Chen; Zitong Chao; Jishan Hu; Can Yang

arXiv:2501.13766·cs.CL·February 26, 2025

UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models

Xin Xu, Jiaxin Zhang, Tianhao Chen, Zitong Chao, Jishan Hu, Can Yang

PDF

Open Access 1 Datasets

TL;DR

UGMathBench is a comprehensive, diverse, and dynamic benchmark designed to evaluate undergraduate-level mathematical reasoning in large language models, featuring extensive problem coverage, multiple answer types, and novel metrics for robustness.

Contribution

Introduces UGMathBench, a new benchmark with 5,062 problems across 16 subjects, and proposes effective accuracy and reasoning gap metrics for evaluating LLMs' mathematical reasoning.

Findings

01

Highest EAcc achieved is 56.3% by OpenAI-o1-mini.

02

Large reasoning gaps observed across different models.

03

Benchmark and evaluation codes are publicly released.

Abstract

Large Language Models (LLMs) have made significant strides in mathematical reasoning, underscoring the need for a comprehensive and fair evaluation of their capabilities. However, existing benchmarks often fall short, either lacking extensive coverage of undergraduate-level mathematical problems or probably suffering from test-set contamination. To address these issues, we introduce UGMathBench, a diverse and dynamic benchmark specifically designed for evaluating undergraduate-level mathematical reasoning with LLMs. UGMathBench comprises 5,062 problems across 16 subjects and 111 topics, featuring 10 distinct answer types. Each problem includes three randomized versions, with additional versions planned for release as leading open-source LLMs become saturated in UGMathBench. Furthermore, we propose two key metrics: effective accuracy (EAcc), which measures the percentage of correctly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

UGMathBench/ugmathbench
dataset· 342 dl
342 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Mathematics, Computing, and Information Processing · Topic Modeling