RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding

Guanzheng Chen; Qilong Feng; Jinjie Ni; Xin Li; Michael Qizhe Shieh

arXiv:2502.20330·cs.CL·June 24, 2025

RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding

Guanzheng Chen, Qilong Feng, Jinjie Ni, Xin Li, Michael Qizhe Shieh

PDF

1 Repo

TL;DR

RAPID introduces a retrieval-augmented speculative decoding method that significantly improves the efficiency and quality of long-context large language model inference by combining retrieval-augmented generation with speculative decoding techniques.

Contribution

The paper proposes RAPID, a novel approach that integrates retrieval-augmented generation with speculative decoding to enhance long-context inference efficiency and performance.

Findings

01

Achieves over 2x speedup in long-context inference.

02

Improves performance metrics significantly on LLaMA-3.1 and Qwen2.5 models.

03

Demonstrates robustness across various context lengths and retrieval qualities.

Abstract

The emergence of long-context large language models (LLMs) offers a promising alternative to traditional retrieval-augmented generation (RAG) for processing extensive documents. However, the computational overhead of long-context inference presents significant efficiency challenges. While Speculative Decoding (SD) traditionally accelerates inference using smaller draft models, its effectiveness diminishes substantially in long-context scenarios due to memory-bound KV cache operations. We introduce Retrieval-Augmented Speculative Decoding (RAPID), which leverages RAG for both accelerating and enhancing generation quality in long-context inference. RAPID introduces the RAG drafter-a draft LLM operating on shortened retrieval contexts-to speculate on the generation of long-context target LLMs. Our approach enables a new paradigm where same-scale or even larger LLMs can serve as RAG…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

john-ai-lab/rapid
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · Attention Dropout · Dense Connections · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Attention Is All You Need · Linear Warmup With Linear Decay