Benchmarking Large Language Models for Math Reasoning Tasks
Kathrin Se{\ss}ler, Yao Rong, Emek G\"ozl\"ukl\"u, Enkelejda Kasneci

TL;DR
This paper introduces a comprehensive benchmark comparing seven in-context learning algorithms across five mathematical datasets using four large foundation models, highlighting model size effects and prompting strategies for math reasoning.
Contribution
It provides the first fair, large-scale benchmark for evaluating LLMs on math reasoning tasks, including analysis of efficiency, performance trade-offs, and prompt optimization.
Findings
Larger models like GPT-4o and LLaMA 3-70B solve math reasoning independently of prompts.
Smaller models' performance heavily depends on in-context learning strategies.
Optimal prompts vary depending on the foundation model used.
Abstract
The use of Large Language Models (LLMs) in mathematical reasoning has become a cornerstone of related research, demonstrating the intelligence of these models and enabling potential practical applications through their advanced performance, such as in educational settings. Despite the variety of datasets and in-context learning algorithms designed to improve the ability of LLMs to automate mathematical problem solving, the lack of comprehensive benchmarking across different datasets makes it complicated to select an appropriate model for specific tasks. In this project, we present a benchmark that fairly compares seven state-of-the-art in-context learning algorithms for mathematical problem solving across five widely used mathematical datasets on four powerful foundation models. Furthermore, we explore the trade-off between efficiency and performance, highlighting the practical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Intelligent Tutoring Systems and Adaptive Learning · Topic Modeling
MethodsLLaMA
