MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large   Language Models Using Odyssey Math Data

Meng Fang; Xiangpeng Wan; Fei Lu; Fei Xing; Kai Zou

arXiv:2406.18321·cs.CL·June 27, 2024·2 cites

MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data

Meng Fang, Xiangpeng Wan, Fei Lu, Fei Xing, Kai Zou

PDF

Open Access 3 Repos

TL;DR

This paper introduces the MathOdyssey dataset to benchmark large language models' mathematical problem-solving skills, revealing current limitations and guiding future improvements in complex reasoning abilities.

Contribution

The paper presents a new diverse dataset for evaluating LLMs on advanced math problems and provides benchmarking results highlighting current model capabilities and challenges.

Findings

01

LLMs perform well on routine tasks but struggle with Olympiad-level problems.

02

A performance gap exists between open-source and closed-source models.

03

Significant challenges remain for models on complex university-level questions.

Abstract

Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities. Despite these successes, most LLMs still struggle with solving mathematical problems due to the intricate reasoning required. This paper investigates the mathematical problem-solving capabilities of LLMs using the newly developed "MathOdyssey" dataset. The dataset includes diverse mathematical problems at high school and university levels, created by experts from notable institutions to rigorously test LLMs in advanced problem-solving scenarios and cover a wider range of subject areas. By providing the MathOdyssey dataset as a resource to the AI community, we aim to contribute to the understanding and improvement of AI capabilities in complex mathematical problem-solving. We conduct benchmarking on open-source models, such as Llama-3 and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Warmup With Cosine Annealing · Byte Pair Encoding · Attention Dropout · Dropout · Adam · Linear Layer · Dense Connections