RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented   Instructions

Wanlong Liu; Junying Chen; Ke Ji; Li Zhou; Wenyu Chen; Benyou Wang

arXiv:2501.00353·cs.CL·January 3, 2025

RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions

Wanlong Liu, Junying Chen, Ke Ji, Li Zhou, Wenyu Chen, Benyou Wang

PDF

Open Access 1 Repo 2 Models 5 Datasets 1 Video

TL;DR

RAG-Instruct introduces a versatile method to generate diverse, high-quality retrieval-augmented instructions from any source, significantly improving LLMs' ability to handle various RAG scenarios and tasks.

Contribution

It presents a general approach to synthesize a large, diverse RAG instruction dataset from Wikipedia, enhancing LLMs' retrieval-augmented capabilities across multiple scenarios.

Findings

01

Constructed a 40K instruction dataset covering diverse RAG tasks

02

Achieved strong zero-shot performance on RAG benchmarks

03

Outperformed existing RAG baselines significantly

Abstract

Retrieval-Augmented Generation (RAG) has emerged as a key paradigm for enhancing large language models (LLMs) by incorporating external knowledge. However, current RAG methods face two limitations: (1) they only cover limited RAG scenarios. (2) They suffer from limited task diversity due to the lack of a general RAG dataset. To address these limitations, we propose RAG-Instruct, a general method for synthesizing diverse and high-quality RAG instruction data based on any source corpus. Our approach leverages (1) five RAG paradigms, which encompass diverse query-document relationships, and (2) instruction simulation, which enhances instruction diversity and quality by utilizing the strengths of existing instruction datasets. Using this method, we construct a 40K instruction dataset from Wikipedia, comprehensively covering diverse RAG scenarios and tasks. Experiments demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

freedomintelligence/rag-instruct
pytorchOfficial

Models

Datasets

Videos

RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions· underline

Taxonomy

TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing · Handwritten Text Recognition Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Layer Normalization · Byte Pair Encoding · Dense Connections · Attention Dropout · WordPiece · Dropout · Linear Layer · Softmax