Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA
Zhanli Li, Yixuan Cao, Lvzhou Luo, Ping Luo

TL;DR
MuDABench is a new benchmark for multi-document analytical question answering over large, semi-structured collections, emphasizing extensive cross-document reasoning and aggregation, with a multi-agent system improving performance but still lagging behind humans.
Contribution
The paper introduces MuDABench, a large-scale benchmark for multi-document analytical QA, and proposes a multi-agent workflow to enhance reasoning over extensive document collections.
Findings
Standard RAG systems perform poorly on MuDABench.
Multi-agent workflow improves reasoning and answer accuracy.
Significant gap remains between system performance and human experts.
Abstract
This paper introduces the task of analytical question answering over large, semi-structured document collections. We present MuDABench, a benchmark for multi-document analytical QA, where questions require extracting and synthesizing information across numerous documents to perform quantitative analysis. Unlike existing multi-document QA benchmarks that typically require information from only a few documents with limited cross-document reasoning, MuDABench demands extensive inter-document analysis and aggregation. Constructed via distant supervision by leveraging document-level metadata and annotated financial databases, MuDABench comprises over 80,000 pages and 332 analytical QA instances. We also propose an evaluation protocol that measures final answer accuracy and uses intermediate-fact coverage as an auxiliary diagnostic signal for the reasoning process. Experiments reveal that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
