Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection
Yun Zhu, Jia-Chen Gu, Caitlin Sikora, Ho Ko, Yinxiao Liu, Chu-Cheng, Lin, Lei Shu, Liangchen Luo, Lei Meng, Bang Liu, Jindong Chen

TL;DR
Sparse RAG introduces a sparse retrieval and decoding mechanism for retrieval-augmented generation, significantly reducing latency and computational costs while maintaining high-quality output across various tasks.
Contribution
The paper presents Sparse RAG, a novel approach that encodes retrieved documents in parallel and selectively attends to relevant caches, improving efficiency and relevance in RAG systems.
Findings
Reduces inference latency by eliminating long-range attention delays.
Maintains high generation quality while decreasing computational costs.
Demonstrates effectiveness across short- and long-form tasks.
Abstract
Large language models (LLMs) augmented with retrieval exhibit robust performance and extensive versatility by incorporating external contexts. However, the input length grows linearly in the number of retrieved documents, causing a dramatic increase in latency. In this paper, we propose a novel paradigm named Sparse RAG, which seeks to cut computation costs through sparsity. Specifically, Sparse RAG encodes retrieved documents in parallel, which eliminates latency introduced by long-range attention of retrieved documents. Then, LLMs selectively decode the output by only attending to highly relevant caches auto-regressively, which are chosen via prompting LLMs with special control tokens. It is notable that Sparse RAG combines the assessment of each individual document and the generation of the response into a single process. The designed sparse mechanism in a RAG system can facilitate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques · Speech Recognition and Synthesis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · WordPiece · Linear Warmup With Linear Decay · Weight Decay · Attention Dropout · Linear Layer · Byte Pair Encoding · Adam · Residual Connection
