Accelerating Adaptive Retrieval Augmented Generation via Instruction-Driven Representation Reduction of Retrieval Overlaps

Jie Ou; Jinyu Guo; Shuaihong Jiang; Zhaokun Wang; Libo Qin; Shunyu Yao; Wenhong Tian

arXiv:2505.12731·cs.AI·May 27, 2025

Accelerating Adaptive Retrieval Augmented Generation via Instruction-Driven Representation Reduction of Retrieval Overlaps

Jie Ou, Jinyu Guo, Shuaihong Jiang, Zhaokun Wang, Libo Qin, Shunyu Yao, Wenhong Tian

PDF

Open Access

TL;DR

This paper proposes a model-agnostic method to accelerate retrieval-augmented generation by reducing redundant content processing through instruction-driven representation reduction, significantly improving efficiency without sacrificing quality.

Contribution

It introduces a novel, instruction-driven approach to minimize retrieval overlap redundancy in A-RAG, enhancing efficiency in large language model applications.

Findings

01

Achieves 2.79x acceleration in prefilling

02

Achieves 2.33x acceleration in decoding

03

Maintains equivalent generation quality

Abstract

Retrieval-augmented generation (RAG) has emerged as a pivotal method for expanding the knowledge of large language models. To handle complex queries more effectively, researchers developed Adaptive-RAG (A-RAG) to enhance the generated quality through multiple interactions with external knowledge bases. Despite its effectiveness, A-RAG exacerbates the pre-existing efficiency challenges inherent in RAG, which are attributable to its reliance on multiple iterations of generation. Existing A-RAG approaches process all retrieved contents from scratch. However, they ignore the situation where there is a significant overlap in the content of the retrieval results across rounds. The overlapping content is redundantly represented, which leads to a large proportion of repeated computations, thus affecting the overall efficiency. To address this issue, this paper introduces a model-agnostic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Information Retrieval and Search Behavior

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Layer Normalization · Softmax · Attention Dropout · WordPiece · Residual Connection · Linear Layer · Byte Pair Encoding