LongDA: Benchmarking LLM Agents for Long-Document Data Analysis

Yiyang Li; Zheyuan Zhang; Tianyi Ma; Zehong Wang; Keerthiram Murugesan; Chuxu Zhang; Yanfang Ye

arXiv:2601.02598·cs.DL·January 13, 2026

LongDA: Benchmarking LLM Agents for Long-Document Data Analysis

Yiyang Li, Zheyuan Zhang, Tianyi Ma, Zehong Wang, Keerthiram Murugesan, Chuxu Zhang, Yanfang Ye

PDF

Open Access 1 Datasets

TL;DR

LongDA is a comprehensive benchmark designed to evaluate LLM agents' ability to analyze long, complex documents in real-world data analysis tasks, revealing significant performance gaps among current models.

Contribution

The paper introduces LongDA, a new benchmark and framework for assessing LLM agents on long-document analysis tasks with real-world data, highlighting existing challenges.

Findings

01

State-of-the-art models show substantial performance gaps.

02

LongDA effectively simulates real-world analytical workflows.

03

Evaluation reveals critical challenges for deploying LLMs in decision-making contexts.

Abstract

We introduce LongDA, a data analysis benchmark for evaluating LLM-based agents under documentation-intensive analytical workflows. In contrast to existing benchmarks that assume well-specified schemas and inputs, LongDA targets real-world settings in which navigating long documentation and complex data is the primary bottleneck. To this end, we manually curate raw data files, long and heterogeneous documentation, and expert-written publications from 17 publicly available U.S. national surveys, from which we extract 505 analytical queries grounded in real analytical practice. Solving these queries requires agents to first retrieve and integrate key information from multiple unstructured documents, before performing multi-step computations and writing executable code, which remains challenging for existing data analysis agents. To support the systematic evaluation under this setting, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

EvilBench/LongDA
dataset· 268 dl
268 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Scientific Computing and Data Management · Research Data Management Practices