DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in   Understanding Long and Specialized Documents

Yilun Zhao; Yitao Long; Hongjun Liu; Ryo Kamoi; Linyong Nan; Lyuhao; Chen; Yixin Liu; Xiangru Tang; Rui Zhang; Arman Cohan

arXiv:2311.09805·cs.CL·August 12, 2024·2 cites

DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents

Yilun Zhao, Yitao Long, Hongjun Liu, Ryo Kamoi, Linyong Nan, Lyuhao, Chen, Yixin Liu, Xiangru Tang, Rui Zhang, Arman Cohan

PDF

Open Access 1 Repo 3 Datasets

TL;DR

This paper introduces DocMath-Eval, a benchmark for assessing the numerical reasoning abilities of large language models in understanding complex, specialized documents with text and tables, revealing current limitations compared to human experts.

Contribution

The paper presents a new benchmark, DocMath-Eval, and provides an extensive evaluation of 48 LLMs, highlighting their strengths and weaknesses in specialized numerical reasoning tasks.

Findings

01

GPT-4o performs best among evaluated models.

02

Current LLMs lag behind human experts in complex reasoning.

03

DocMath-Eval serves as a valuable tool for future model development.

Abstract

Recent LLMs have demonstrated remarkable performance in solving exam-like math word problems. However, the degree to which these numerical reasoning skills are effective in real-world scenarios, particularly in expert domains, is still largely unexplored. This paper introduces DocMath-Eval, a comprehensive benchmark specifically designed to evaluate the numerical reasoning capabilities of LLMs in the context of understanding and analyzing specialized documents containing both text and tables. We conduct an extensive evaluation of 48 LLMs with Chain-of-Thought and Program-of-Thought prompting methods, aiming to comprehensively assess the capabilities and limitations of existing LLMs in DocMath-Eval. We found that even the current best-performing system (i.e., GPT-4o) still significantly lags behind human experts in solving complex numerical reasoning problems grounded in long contexts.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yale-nlp/docmath-eval
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Statistics Education and Methodologies · Topic Modeling