Benchmarking Large Language Models for Math Reasoning Tasks

Kathrin Se{\ss}ler; Yao Rong; Emek G\"ozl\"ukl\"u; Enkelejda Kasneci

arXiv:2408.10839·cs.CL·December 20, 2024·2 cites

Benchmarking Large Language Models for Math Reasoning Tasks

Kathrin Se{\ss}ler, Yao Rong, Emek G\"ozl\"ukl\"u, Enkelejda Kasneci

PDF

Open Access 1 Repo

TL;DR

This paper introduces a comprehensive benchmark comparing seven in-context learning algorithms across five mathematical datasets using four large foundation models, highlighting model size effects and prompting strategies for math reasoning.

Contribution

It provides the first fair, large-scale benchmark for evaluating LLMs on math reasoning tasks, including analysis of efficiency, performance trade-offs, and prompt optimization.

Findings

01

Larger models like GPT-4o and LLaMA 3-70B solve math reasoning independently of prompts.

02

Smaller models' performance heavily depends on in-context learning strategies.

03

Optimal prompts vary depending on the foundation model used.

Abstract

The use of Large Language Models (LLMs) in mathematical reasoning has become a cornerstone of related research, demonstrating the intelligence of these models and enabling potential practical applications through their advanced performance, such as in educational settings. Despite the variety of datasets and in-context learning algorithms designed to improve the ability of LLMs to automate mathematical problem solving, the lack of comprehensive benchmarking across different datasets makes it complicated to select an appropriate model for specific tasks. In this project, we present a benchmark that fairly compares seven state-of-the-art in-context learning algorithms for mathematical problem solving across five widely used mathematical datasets on four powerful foundation models. Furthermore, we explore the trade-off between efficiency and performance, highlighting the practical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kathrinse/math-reasoning-benchmark
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Intelligent Tutoring Systems and Adaptive Learning · Topic Modeling

MethodsLLaMA