DRAGON: Domain-specific Robust Automatic Data Generation for RAG Optimization

Haiyang Shen; Hang Yan; Zhongshi Xing; Mugeng Liu; Yue Li; Zhiyang Chen; Yuxiang Wang; Jiuzheng Wang; Yun Ma

arXiv:2505.10989·cs.AI·February 10, 2026

DRAGON: Domain-specific Robust Automatic Data Generation for RAG Optimization

Haiyang Shen, Hang Yan, Zhongshi Xing, Mugeng Liu, Yue Li, Zhiyang Chen, Yuxiang Wang, Jiuzheng Wang, Yun Ma

PDF

Open Access 1 Repo 1 Video

TL;DR

DRAGON introduces a domain-specific data generation framework and benchmark to improve retrieval-augmented generation (RAG) performance, robustness, and cross-domain generalization in knowledge-intensive tasks.

Contribution

It presents a novel data-construction and synthetic data-generation pipeline tailored for domain-specific RAG, along with DRAGONBench, a comprehensive benchmark for evaluation.

Findings

01

Retrievers trained on DRAGON-generated data show significant performance improvements.

02

The approach enhances cross-domain generalization of RAG systems.

03

Integrated optimized retrievers improve accuracy across various RAG paradigms.

Abstract

Retrieval-augmented generation (RAG) can substantially enhance the performance of LLMs on knowledge-intensive tasks. Various RAG paradigms - including vanilla, planning-based, and iterative RAG - all depend on a robust retriever, yet existing retrievers rely heavily on public knowledge and often falter when faced with domain-specific queries. To address these limitations, we introduce DRAGON, a framework that combines a data-construction modeling approach with a scalable synthetic data-generation pipeline, specifically designed to optimize domain-specific retrieval performance and bolster retriever robustness. To evaluate RAG performance on domain-specific RAGs, we propose DRAGONBench, a benchmark spanning 8 domain-specific document collections across 4 distinct fields and featuring a wide spectrum of query complexities, answerability, and hop numbers. Leveraging DRAGON, we generate a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eachsheep/ragsynth
noneOfficial

Videos

DRAGON: Domain-specific Robust Automatic Data Generation for RAG Optimization· underline

Taxonomy

TopicsInformation Retrieval and Search Behavior · Multimodal Machine Learning Applications · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Layer Normalization · Softmax · Attention Dropout · WordPiece · Residual Connection · Linear Layer · Byte Pair Encoding