MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning   in LLMs

Lei Wang; Shan Dong; Yuhui Xu; Hanze Dong; Yalu Wang; Amrita Saha,; Ee-Peng Lim; Caiming Xiong; Doyen Sahoo

arXiv:2410.04698·cs.CL·October 8, 2024

MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

Lei Wang, Shan Dong, Yuhui Xu, Hanze Dong, Yalu Wang, Amrita Saha,, Ee-Peng Lim, Caiming Xiong, Doyen Sahoo

PDF

Open Access

TL;DR

MathHay is an automated benchmark designed to evaluate the long-context mathematical reasoning abilities of large language models, revealing that even top models struggle significantly, thus highlighting the need for further advancements.

Contribution

The paper introduces MathHay, a novel benchmark specifically targeting long-context mathematical reasoning in LLMs, filling a gap in existing evaluation tools.

Findings

01

Top models achieve only around 51% accuracy on MathHay.

02

Even the best model struggles with long-context mathematical reasoning.

03

MathHay reveals significant room for improvement in LLM capabilities.

Abstract

Recent large language models (LLMs) have demonstrated versatile capabilities in long-context scenarios. Although some recent benchmarks have been developed to evaluate the long-context capabilities of LLMs, there is a lack of benchmarks evaluating the mathematical reasoning abilities of LLMs over long contexts, which is crucial for LLMs' application in real-world scenarios. In this paper, we introduce MathHay, an automated benchmark designed to assess the long-context mathematical reasoning capabilities of LLMs. Unlike previous benchmarks like Needle in a Haystack, which focus primarily on information retrieval within long texts, MathHay demands models with both information-seeking and complex mathematical reasoning abilities. We conduct extensive experiments on MathHay to assess the long-context mathematical reasoning abilities of eight top-performing LLMs. Even the best-performing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Scientific Computing and Data Management

MethodsFocus