MDCure: A Scalable Pipeline for Multi-Document Instruction-Following
Gabrielle Kaili-May Liu, Bowen Shi, Avi Caciularu, Idan Szpektor,, Arman Cohan

TL;DR
MDCure is a scalable framework that generates high-quality multi-document instruction data to improve large language models' performance on multi-document tasks without extensive pre-training or human annotation.
Contribution
We introduce MDCure, a novel data generation pipeline and reward model that enhance LLMs' multi-document capabilities efficiently and effectively.
Findings
MDCure improves LLM performance on multi-document benchmarks by up to 75.1%.
The framework enables small open-source models to outperform larger proprietary models.
MDCure's synthetic data generation is cost-effective and compatible with various models and training methods.
Abstract
Multi-document (MD) processing is crucial for LLMs to handle real-world tasks such as summarization and question-answering across large sets of documents. While LLMs have improved at processing long inputs, MD contexts still present unique difficulties, including management of inter-document dependencies, redundancy, and incoherent structures. To address this challenge, we introduce MDCure, a scalable and effective instruction data generation framework to enhance the MD capabilities of LLMs without the computational cost of pre-training or reliance on human-annotated data. MDCure generates high-quality synthetic MD instruction data over sets of articles via targeted prompts. We also introduce MDCureRM, a cost-effective, MD-specific reward model to score and filter generated data based on their training utility for MD settings. MDCure is compatible with open- and closed-source models in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗yale-nlp/MDCureRMmodel· 4 dl· ♡ 34 dl♡ 3
- 🤗yale-nlp/MDCure-FlanT5-Basemodel· 3 dl3 dl
- 🤗yale-nlp/MDCure-FlanT5-Largemodel· 2 dl2 dl
- 🤗yale-nlp/MDCure-Qwen2-1.5B-Instructmodel· 5 dl5 dl
- 🤗yale-nlp/MDCure-Qwen2-7B-Instructmodel· 15 dl· ♡ 115 dl♡ 1
- 🤗yale-nlp/MDCure-LLAMA3.1-8B-Instructmodel· 3 dl· ♡ 23 dl♡ 2
- 🤗yale-nlp/MDCure-LLAMA3.1-70B-Instructmodel· 3 dl3 dl
Videos
Taxonomy
TopicsMathematics, Computing, and Information Processing
MethodsEntropy Regularization · Proximal Policy Optimization · Balanced Selection
