RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions
Wanlong Liu, Junying Chen, Ke Ji, Li Zhou, Wenyu Chen, Benyou Wang

TL;DR
RAG-Instruct introduces a versatile method to generate diverse, high-quality retrieval-augmented instructions from any source, significantly improving LLMs' ability to handle various RAG scenarios and tasks.
Contribution
It presents a general approach to synthesize a large, diverse RAG instruction dataset from Wikipedia, enhancing LLMs' retrieval-augmented capabilities across multiple scenarios.
Findings
Constructed a 40K instruction dataset covering diverse RAG tasks
Achieved strong zero-shot performance on RAG benchmarks
Outperformed existing RAG baselines significantly
Abstract
Retrieval-Augmented Generation (RAG) has emerged as a key paradigm for enhancing large language models (LLMs) by incorporating external knowledge. However, current RAG methods face two limitations: (1) they only cover limited RAG scenarios. (2) They suffer from limited task diversity due to the lack of a general RAG dataset. To address these limitations, we propose RAG-Instruct, a general method for synthesizing diverse and high-quality RAG instruction data based on any source corpus. Our approach leverages (1) five RAG paradigms, which encompass diverse query-document relationships, and (2) instruction simulation, which enhances instruction diversity and quality by utilizing the strengths of existing instruction datasets. Using this method, we construct a 40K instruction dataset from Wikipedia, comprehensively covering diverse RAG scenarios and tasks. Experiments demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing · Handwritten Text Recognition Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Layer Normalization · Byte Pair Encoding · Dense Connections · Attention Dropout · WordPiece · Dropout · Linear Layer · Softmax
