Evaluation of LLMs for mathematical problem solving

Ruonan Wang; Runxi Wang; Yunwen Shen; Chengfeng Wu; Qinglin Zhou; Rohitash Chandra

arXiv:2506.00309·cs.AI·July 1, 2025

Evaluation of LLMs for mathematical problem solving

Ruonan Wang, Runxi Wang, Yunwen Shen, Chengfeng Wu, Qinglin Zhou, Rohitash Chandra

PDF

Open Access

TL;DR

This study evaluates three large language models on mathematical problem solving across various datasets, analyzing their strengths and weaknesses in reasoning, explanation, and accuracy.

Contribution

It introduces a comprehensive five-dimensional assessment framework for LLMs in math problem solving and provides detailed performance comparisons of GPT-4o, DeepSeek-V3, and Gemini-2.0.

Findings

01

GPT-4o is the most stable across datasets, especially in high-level questions.

02

DeepSeek-V3 excels in structured domains like optimization.

03

Gemini-2.0 has strong linguistic understanding but struggles with multi-step reasoning.

Abstract

Large Language Models (LLMs) have shown impressive performance on a range of educational tasks, but are still understudied for their potential to solve mathematical problems. In this study, we compare three prominent LLMs, including GPT-4o, DeepSeek-V3, and Gemini-2.0, on three mathematics datasets of varying complexities (GSM8K, MATH500, and MIT Open Courseware datasets). We take a five-dimensional approach based on the Structured Chain-of-Thought (SCoT) framework to assess final answer correctness, step completeness, step validity, intermediate calculation accuracy, and problem comprehension. The results show that GPT-4o is the most stable and consistent in performance across all the datasets, but particularly it performs outstandingly in high-level questions of the MIT Open Courseware dataset. DeepSeek-V3 is competitively strong in well-structured domains such as optimisation, but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Topic Modeling · Text Readability and Simplification