MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data
Meng Fang, Xiangpeng Wan, Fei Lu, Fei Xing, Kai Zou

TL;DR
This paper introduces the MathOdyssey dataset to benchmark large language models' mathematical problem-solving skills, revealing current limitations and guiding future improvements in complex reasoning abilities.
Contribution
The paper presents a new diverse dataset for evaluating LLMs on advanced math problems and provides benchmarking results highlighting current model capabilities and challenges.
Findings
LLMs perform well on routine tasks but struggle with Olympiad-level problems.
A performance gap exists between open-source and closed-source models.
Significant challenges remain for models on complex university-level questions.
Abstract
Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities. Despite these successes, most LLMs still struggle with solving mathematical problems due to the intricate reasoning required. This paper investigates the mathematical problem-solving capabilities of LLMs using the newly developed "MathOdyssey" dataset. The dataset includes diverse mathematical problems at high school and university levels, created by experts from notable institutions to rigorously test LLMs in advanced problem-solving scenarios and cover a wider range of subject areas. By providing the MathOdyssey dataset as a resource to the AI community, we aim to contribute to the understanding and improvement of AI capabilities in complex mathematical problem-solving. We conduct benchmarking on open-source models, such as Llama-3 and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Warmup With Cosine Annealing · Byte Pair Encoding · Attention Dropout · Dropout · Adam · Linear Layer · Dense Connections
