U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

Konstantin Chernyshev; Vitaliy Polshkov; Ekaterina Artemova; Alex Myasnikov; Vlad Stepanov; Alexei Miasnikov; Sergei Tilga

arXiv:2412.03205·cs.CL·February 3, 2026

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov, Sergei Tilga

PDF

Open Access 1 Repo 2 Datasets 1 Video 3 Reviews

TL;DR

U-MATH introduces a comprehensive university-level benchmark with diverse, open-ended problems across multiple subjects, highlighting current LLMs' limitations especially in multimodal reasoning and solution judgment.

Contribution

The paper presents U-MATH, a new large-scale, diverse benchmark for evaluating LLMs on university-level mathematics, including multimodal problems and a solution judgment dataset.

Findings

01

LLMs perform well on textual tasks with up to 93.1% accuracy.

02

Multimodal reasoning remains challenging, with only 58.5% accuracy.

03

Solution judgment accuracy peaks at 90.1% F1-score, still imperfect.

Abstract

The current evaluation of mathematical skills in LLMs is limited, as existing benchmarks are either relatively small, primarily focus on elementary and high-school problems, or lack diversity in topics. Additionally, the inclusion of visual elements in tasks remains largely under-explored. To address these gaps, we introduce U-MATH, a novel benchmark of 1,100 unpublished open-ended university-level problems sourced from teaching materials. It is balanced across six core subjects, with 20% of multimodal problems. Given the open-ended nature of U-MATH problems, we employ an LLM to judge the correctness of generated solutions. To this end, we release $μ$ -MATH, a dataset to evaluate the LLMs' capabilities in judging solutions. Benchmarking leading LLMs reveals marked limitations in multi-modal reasoning, with maximum accuracy reaching 93.1\% on textual tasks but only 58.5\% on visual…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

1. The paper is well-organized, providing a clear outline of the datasets, experimental setup, and evaluation metrics. The authors explain each component in a structured manner, making it accessible to readers. 2. The datasets include a range of mathematical subjects and problem types, which reflects an effort to cover diverse aspects of mathematical reasoning, though the depth and breadth could still be improved. 3. The introduction of U-MATH and µ-MATH provides additional benchmarks for eval

Weaknesses

1. Although the U-MATH datasets consists of 1,125 samples and covers six subjects, the sample size is still too small. Evaluating the mathematical abilities of large models using a limited amount of data is not sufficiently convincing. 2. Although the 340 samples in the µ-MATH datasets have been carefully selected to provide a challenging test, a larger sample size could enhance the representativeness of the evaluation, especially across different topics and problem types.

Reviewer 02Rating 6Confidence 3

Strengths

S1. The inclusion of university-level problems offers a significant advancement over existing datasets that mainly focus on elementary or high school-level tasks. S2: By integrating visual tasks alongside traditional textual ones, the dataset challenges LLMs to interpret and reason across multimodal formats. S3: µ-MATH introduces a novel approach to evaluate LLMs' ability to assess solutions, addressing biases and limitations in current evaluation practices.

Weaknesses

W1: The reliance on LLMs as judges (e.g., GPT-4o) to evaluate free-form answers could introduce biases and inconsistencies, particularly since LLMs may struggle with complex derivations or nuanced interpretations of mathematical expressions. W2: The µ-MATH set includes LLM-generated solutions, which may limit the diversity and challenge of evaluation due to inherent model tendencies or training biases. This could result in less rigorous meta-evaluation as models may overfit to known patterns or

Reviewer 03Rating 5Confidence 3

Strengths

1.U-MATH Benchmark: This is a publicly available dataset of university-level math problems, covering six topics: Pre-Calculus, Algebra, Differential Calculus, Integral Calculus, Multivariable Calculus, and Sequences & Series. A unique aspect of this dataset is its inclusion of open-ended questions that require LLMs to perform multi-step reasoning. 2.µ-MATH Meta-Evaluation Benchmark: This benchmark is specifically designed to test LLMs’ ability to assess the correctness of mathematical solutions.

Weaknesses

1.The U-MATH dataset introduced in the paper supplements the current math datasets by addressing college-level gaps, while the µ-MATH meta-evaluation dataset enables assessment of large models’ ability to evaluate college-level math solutions. However, aside from knowing that this training set focuses on university mathematics and includes six subjects, we lack information about the dataset’s question diversity, difficulty, reasoning steps required to solve the problems, and other aspects. Addit

Code & Models

Repositories

toloka/u-math
noneOfficial

Datasets

Videos

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs· underline

Taxonomy

TopicsOpen Education and E-Learning

MethodsFocus