SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large   Language Models to Specialized Domains

Ran Xu; Hui Liu; Sreyashi Nag; Zhenwei Dai; Yaochen Xie; Xianfeng; Tang; Chen Luo; Yang Li; Joyce C. Ho; Carl Yang; Qi He

arXiv:2410.17952·cs.CL·January 28, 2025

SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains

Ran Xu, Hui Liu, Sreyashi Nag, Zhenwei Dai, Yaochen Xie, Xianfeng, Tang, Chen Luo, Yang Li, Joyce C. Ho, Carl Yang, Qi He

PDF

Open Access 1 Video

TL;DR

SimRAG is a self-training method that enhances large language models' ability to perform domain-specific question answering by generating and filtering synthetic questions from unlabeled data, leading to improved performance in specialized fields.

Contribution

Introduces SimRAG, a novel self-training approach that enables LLMs to adapt to specialized domains using question generation and filtering without extensive labeled data.

Findings

01

Outperforms baselines by 1.2%--8.6% on 11 datasets

02

Effective in science and medicine domains

03

Utilizes synthetic question generation for domain adaptation

Abstract

Retrieval-augmented generation (RAG) enhances the question-answering (QA) abilities of large language models (LLMs) by integrating external knowledge. However, adapting general-purpose RAG systems to specialized fields such as science and medicine poses unique challenges due to distribution shifts and limited access to domain-specific data. To tackle this, we propose SimRAG, a self-training approach that equips the LLM with joint capabilities of question answering and question generation for domain adaptation. Our method first fine-tunes the LLM on instruction-following, question-answering, and search-related data. Then, it prompts the same LLM to generate diverse domain-relevant questions from unlabeled corpora, with an additional filtering strategy to retain high-quality synthetic examples. By leveraging these self-generated synthetic examples, the LLM can improve their performance on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Adam · Linear Layer · Dropout · Byte Pair Encoding · Layer Normalization · Residual Connection · Linear Warmup With Linear Decay · Attention Is All You Need · Dense Connections