TL;DR
RAPID introduces a retrieval-augmented speculative decoding method that significantly improves the efficiency and quality of long-context large language model inference by combining retrieval-augmented generation with speculative decoding techniques.
Contribution
The paper proposes RAPID, a novel approach that integrates retrieval-augmented generation with speculative decoding to enhance long-context inference efficiency and performance.
Findings
Achieves over 2x speedup in long-context inference.
Improves performance metrics significantly on LLaMA-3.1 and Qwen2.5 models.
Demonstrates robustness across various context lengths and retrieval qualities.
Abstract
The emergence of long-context large language models (LLMs) offers a promising alternative to traditional retrieval-augmented generation (RAG) for processing extensive documents. However, the computational overhead of long-context inference presents significant efficiency challenges. While Speculative Decoding (SD) traditionally accelerates inference using smaller draft models, its effectiveness diminishes substantially in long-context scenarios due to memory-bound KV cache operations. We introduce Retrieval-Augmented Speculative Decoding (RAPID), which leverages RAG for both accelerating and enhancing generation quality in long-context inference. RAPID introduces the RAG drafter-a draft LLM operating on shortened retrieval contexts-to speculate on the generation of long-context target LLMs. Our approach enables a new paradigm where same-scale or even larger LLMs can serve as RAG…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · Attention Dropout · Dense Connections · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Attention Is All You Need · Linear Warmup With Linear Decay
