UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts

Bo Yang; Qingping Yang; Yingwei Ma; Runtao Liu

arXiv:2411.07240·cs.CL·January 15, 2025

UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts

Bo Yang, Qingping Yang, Yingwei Ma, Runtao Liu

PDF

Open Access 1 Repo 1 Datasets

TL;DR

UTMath introduces a comprehensive benchmark with extensive unit tests across multiple mathematical domains to evaluate LLMs' reasoning and generality, along with a reasoning-to-coding approach to improve performance.

Contribution

The paper presents the UTMath benchmark and the RCoT method, offering a new framework for assessing and enhancing LLMs' mathematical reasoning capabilities.

Findings

01

Best model solves 32.57% of problems

02

UTMath contains 1,053 problems across nine domains

03

RCoT improves reasoning and solution quality

Abstract

The evaluation of mathematical reasoning capabilities is essential for advancing Artificial General Intelligence (AGI). While Large Language Models (LLMs) have shown impressive performance in solving mathematical problems, existing benchmarks such as GSM8K and MATH present limitations, including narrow problem definitions with specific numbers and reliance on predetermined rules that hinder accurate assessments of reasoning and generality. This paper introduces the UTMath Benchmark, a robust evaluation framework designed to assess LLMs through extensive unit tests, with a focus on both the accuracy and generality of model responses. It comprises 1,053 cutting-edge problems spanning nine mathematical domains, with an average of 68 test cases per problem. UTMath is highly challenging, with the best-performing model, o1-mini, solving only 32.57\% of the problems, followed by o1-preview at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

utmathgroup/utmath
noneOfficial

Datasets

ReasonMind/UTMath
dataset· 181 dl
181 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning

MethodsFocus