UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts
Bo Yang, Qingping Yang, Yingwei Ma, Runtao Liu

TL;DR
UTMath introduces a comprehensive benchmark with extensive unit tests across multiple mathematical domains to evaluate LLMs' reasoning and generality, along with a reasoning-to-coding approach to improve performance.
Contribution
The paper presents the UTMath benchmark and the RCoT method, offering a new framework for assessing and enhancing LLMs' mathematical reasoning capabilities.
Findings
Best model solves 32.57% of problems
UTMath contains 1,053 problems across nine domains
RCoT improves reasoning and solution quality
Abstract
The evaluation of mathematical reasoning capabilities is essential for advancing Artificial General Intelligence (AGI). While Large Language Models (LLMs) have shown impressive performance in solving mathematical problems, existing benchmarks such as GSM8K and MATH present limitations, including narrow problem definitions with specific numbers and reliance on predetermined rules that hinder accurate assessments of reasoning and generality. This paper introduces the UTMath Benchmark, a robust evaluation framework designed to assess LLMs through extensive unit tests, with a focus on both the accuracy and generality of model responses. It comprises 1,053 cutting-edge problems spanning nine mathematical domains, with an average of 68 test cases per problem. UTMath is highly challenging, with the best-performing model, o1-mini, solving only 32.57\% of the problems, followed by o1-preview at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning
MethodsFocus
