DRAGON: Domain-specific Robust Automatic Data Generation for RAG Optimization
Haiyang Shen, Hang Yan, Zhongshi Xing, Mugeng Liu, Yue Li, Zhiyang Chen, Yuxiang Wang, Jiuzheng Wang, Yun Ma

TL;DR
DRAGON introduces a domain-specific data generation framework and benchmark to improve retrieval-augmented generation (RAG) performance, robustness, and cross-domain generalization in knowledge-intensive tasks.
Contribution
It presents a novel data-construction and synthetic data-generation pipeline tailored for domain-specific RAG, along with DRAGONBench, a comprehensive benchmark for evaluation.
Findings
Retrievers trained on DRAGON-generated data show significant performance improvements.
The approach enhances cross-domain generalization of RAG systems.
Integrated optimized retrievers improve accuracy across various RAG paradigms.
Abstract
Retrieval-augmented generation (RAG) can substantially enhance the performance of LLMs on knowledge-intensive tasks. Various RAG paradigms - including vanilla, planning-based, and iterative RAG - all depend on a robust retriever, yet existing retrievers rely heavily on public knowledge and often falter when faced with domain-specific queries. To address these limitations, we introduce DRAGON, a framework that combines a data-construction modeling approach with a scalable synthetic data-generation pipeline, specifically designed to optimize domain-specific retrieval performance and bolster retriever robustness. To evaluate RAG performance on domain-specific RAGs, we propose DRAGONBench, a benchmark spanning 8 domain-specific document collections across 4 distinct fields and featuring a wide spectrum of query complexities, answerability, and hop numbers. Leveraging DRAGON, we generate a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsInformation Retrieval and Search Behavior · Multimodal Machine Learning Applications · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Layer Normalization · Softmax · Attention Dropout · WordPiece · Residual Connection · Linear Layer · Byte Pair Encoding
