Evaluation of LLMs for mathematical problem solving
Ruonan Wang, Runxi Wang, Yunwen Shen, Chengfeng Wu, Qinglin Zhou, Rohitash Chandra

TL;DR
This study evaluates three large language models on mathematical problem solving across various datasets, analyzing their strengths and weaknesses in reasoning, explanation, and accuracy.
Contribution
It introduces a comprehensive five-dimensional assessment framework for LLMs in math problem solving and provides detailed performance comparisons of GPT-4o, DeepSeek-V3, and Gemini-2.0.
Findings
GPT-4o is the most stable across datasets, especially in high-level questions.
DeepSeek-V3 excels in structured domains like optimization.
Gemini-2.0 has strong linguistic understanding but struggles with multi-step reasoning.
Abstract
Large Language Models (LLMs) have shown impressive performance on a range of educational tasks, but are still understudied for their potential to solve mathematical problems. In this study, we compare three prominent LLMs, including GPT-4o, DeepSeek-V3, and Gemini-2.0, on three mathematics datasets of varying complexities (GSM8K, MATH500, and MIT Open Courseware datasets). We take a five-dimensional approach based on the Structured Chain-of-Thought (SCoT) framework to assess final answer correctness, step completeness, step validity, intermediate calculation accuracy, and problem comprehension. The results show that GPT-4o is the most stable and consistent in performance across all the datasets, but particularly it performs outstandingly in high-level questions of the MIT Open Courseware dataset. DeepSeek-V3 is competitively strong in well-structured domains such as optimisation, but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Topic Modeling · Text Readability and Simplification
