MDCure: A Scalable Pipeline for Multi-Document Instruction-Following

Gabrielle Kaili-May Liu; Bowen Shi; Avi Caciularu; Idan Szpektor,; Arman Cohan

arXiv:2410.23463·cs.CL·April 30, 2025

MDCure: A Scalable Pipeline for Multi-Document Instruction-Following

Gabrielle Kaili-May Liu, Bowen Shi, Avi Caciularu, Idan Szpektor,, Arman Cohan

PDF

Open Access 1 Repo 7 Models 3 Datasets 1 Video

TL;DR

MDCure is a scalable framework that generates high-quality multi-document instruction data to improve large language models' performance on multi-document tasks without extensive pre-training or human annotation.

Contribution

We introduce MDCure, a novel data generation pipeline and reward model that enhance LLMs' multi-document capabilities efficiently and effectively.

Findings

01

MDCure improves LLM performance on multi-document benchmarks by up to 75.1%.

02

The framework enables small open-source models to outperform larger proprietary models.

03

MDCure's synthetic data generation is cost-effective and compatible with various models and training methods.

Abstract

Multi-document (MD) processing is crucial for LLMs to handle real-world tasks such as summarization and question-answering across large sets of documents. While LLMs have improved at processing long inputs, MD contexts still present unique difficulties, including management of inter-document dependencies, redundancy, and incoherent structures. To address this challenge, we introduce MDCure, a scalable and effective instruction data generation framework to enhance the MD capabilities of LLMs without the computational cost of pre-training or reliance on human-annotated data. MDCure generates high-quality synthetic MD instruction data over sets of articles via targeted prompts. We also introduce MDCureRM, a cost-effective, MD-specific reward model to score and filter generated data based on their training utility for MD settings. MDCure is compatible with open- and closed-source models in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yale-nlp/mdcure
pytorchOfficial

Models

Datasets

Videos

MDCure: A Scalable Pipeline for Multi-Document Instruction-Following· underline

Taxonomy

TopicsMathematics, Computing, and Information Processing

MethodsEntropy Regularization · Proximal Policy Optimization · Balanced Selection