MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance
Joseph J. Peper, Wenzhao Qiu, Ali Payani, Lu Wang

TL;DR
MDBench is a synthetic multi-document reasoning benchmark created using knowledge-guided generation, providing a controllable and challenging evaluation for large language models in multi-document reasoning tasks.
Contribution
The paper introduces MDBench, a novel synthetic dataset for multi-document reasoning evaluation, generated through a knowledge-guided process that enables targeted and efficient benchmark creation.
Findings
MDBench challenges current LLMs even with short document sets.
Knowledge-guided generation allows targeted analysis of reasoning capabilities.
The benchmark can be quickly adapted for new challenges.
Abstract
Natural language processing evaluation has made significant progress, largely driven by the proliferation of powerful large language mod-els (LLMs). New evaluation benchmarks are of increasing priority as the reasoning capabilities of LLMs are expanding at a rapid pace. In particular, while multi-document (MD) reasoning is an area of extreme relevance given LLM capabilities in handling longer-context inputs, few benchmarks exist to rigorously examine model behavior in this setting. Moreover, the multi-document setting is historically challenging for benchmark creation due to the expensive cost of annotating long inputs. In this work, we introduce MDBench, a new dataset for evaluating LLMs on the task of multi-document reasoning. Notably, MDBench is created through a novel synthetic generation process, allowing us to controllably and efficiently generate challenging document sets and the…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- Interesting findings comparing different LLMs' capabilities for multi-document-based against table-based reasoning. - Interesting idea for generating automatically a multi-document QA dataset using single tables as proxy. - The proposed benchmark showcases the difficulty of frontier models to address challenges associated with the multi-document QA setting, and it may be useful for motivating further work in this space.
- The paper does not properly relate itself to prior work within the domain of multi-document, synthetic dataset generation. What are the main aspects that differentiate it from Schnitzler et al. (2024) and Sprague et al. (2023)? - The quality guarantees are not fully convincing due to absence of statistics that correlate the human judgements with the automatically-validated data instances. - Details about the human evaluation process for the 300 manually annotated cases are missing from the m
1. This paper introduce a benchmark generation pipeline for multi-documents. The pipeline seems reasonable and useful. 2. The authors focus on diverse multi-document settings in the experiments, such as Multi-hop Reasoning and Numeric Reasoning. 3. The curated benchmark is challenging. Many SOTA LLMs perform unsatisfactorily.
1. The authors fail to consider other reasoning baseline methods in their benchmark. The authors should explore and evaluate other more related inference baselines in MD settings, such as the related works in Section 2 mentioned in the paper. 2. Also, there are no comparisons with other data synthetic methods, such as the related works in Section 2. 3. The authors have not provided further convincing analysis on why LLMs fail to complete such tasks, or how to improve their performance. 4. The
- I think that addressing multi-document capabilities of LLMs is a timely question with real-world implications as many tasks are phrased over collections of documents, and there aren't many benchmarks specifically geared to test multi-document performance. - I appreciate that there’s an explicit description of the desired goals for the benchmark (Section 3.1), and I'm generally on board with the goals.
- My main concern with this paper is the quality of the dataset, and how much it represents multi-document tasks. Real-world multi-document corpora contain several unique challenges - the context is long, there are many documents, the narrative may be contradicting, complementing, repeating, without a clear sense of order. While one of the stated goals of the dataset is to be "Grounded in Real-World Scenarios", the paper didn't try to prove that MDBench actually contains any of these challenges.
1. The idea of using structured data to generate documents for multi-document reasoning is interesting. 2. The authors tested the benchmark data on a variety of LLMs and provided some analysis based on that.
1. the dataset is too small to achieve statistical significance - only 300 human verified. 2. the authors didn't clarify their quality control progress in the main text (there are still 2 pages left). 3. examples and demonstration process is difficult to read, making the paper unnecessarily hard to follow.
Videos
Taxonomy
TopicsSemantic Web and Ontologies
