DACL-RAG: Data Augmentation Strategy with Curriculum Learning for Retrieval-Augmented Generation

Shaohan Wang; Licheng Zhang; Zheren Fu; Zhendong Mao; Yongdong Zhang

arXiv:2505.10493·cs.CL·October 7, 2025

DACL-RAG: Data Augmentation Strategy with Curriculum Learning for Retrieval-Augmented Generation

Shaohan Wang, Licheng Zhang, Zheren Fu, Zhendong Mao, Yongdong Zhang

PDF

Open Access

TL;DR

DACL-RAG introduces a multi-stage training framework combining data augmentation and curriculum learning to improve retrieval-augmented generation, addressing data quality and discriminability issues for better performance.

Contribution

The paper proposes a novel multi-stage training framework with data augmentation and curriculum learning for RAG systems, enhancing training stability and effectiveness.

Findings

01

Achieves 2-4% performance improvements on four QA datasets.

02

Effectively addresses data quality and discriminability challenges in RAG training.

03

Demonstrates consistent gains over existing advanced methods.

Abstract

Retrieval-Augmented Generation (RAG) is an effective method to enhance the capabilities of large language models (LLMs). Existing methods typically optimize the retriever or the generator in a RAG system by directly using the top-k retrieved documents. However, two key issues inherent in the training data constrain the effectiveness of this training paradigm: (1) across different queries, the top-k retrieved documents vary greatly in content quality, with some providing valuable knowledge while others lack critical information or are even misleading, and training on such data in a purely random manner may impair the generator's ability to extract key information; (2) for a given query, the limited set of k documents often exhibits low discriminability, and training solely on them makes it difficult for the retriever to learn how to distinguish between relevant and irrelevant documents.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Information Retrieval and Search Behavior · Multimodal Machine Learning Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Layer Normalization · Byte Pair Encoding · Attention Dropout · Softmax · WordPiece · Linear Layer · Weight Decay