SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains
Ran Xu, Hui Liu, Sreyashi Nag, Zhenwei Dai, Yaochen Xie, Xianfeng, Tang, Chen Luo, Yang Li, Joyce C. Ho, Carl Yang, Qi He

TL;DR
SimRAG is a self-training method that enhances large language models' ability to perform domain-specific question answering by generating and filtering synthetic questions from unlabeled data, leading to improved performance in specialized fields.
Contribution
Introduces SimRAG, a novel self-training approach that enables LLMs to adapt to specialized domains using question generation and filtering without extensive labeled data.
Findings
Outperforms baselines by 1.2%--8.6% on 11 datasets
Effective in science and medicine domains
Utilizes synthetic question generation for domain adaptation
Abstract
Retrieval-augmented generation (RAG) enhances the question-answering (QA) abilities of large language models (LLMs) by integrating external knowledge. However, adapting general-purpose RAG systems to specialized fields such as science and medicine poses unique challenges due to distribution shifts and limited access to domain-specific data. To tackle this, we propose SimRAG, a self-training approach that equips the LLM with joint capabilities of question answering and question generation for domain adaptation. Our method first fine-tunes the LLM on instruction-following, question-answering, and search-related data. Then, it prompts the same LLM to generate diverse domain-relevant questions from unlabeled corpora, with an additional filtering strategy to retain high-quality synthetic examples. By leveraging these self-generated synthetic examples, the LLM can improve their performance on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Adam · Linear Layer · Dropout · Byte Pair Encoding · Layer Normalization · Residual Connection · Linear Warmup With Linear Decay · Attention Is All You Need · Dense Connections
