REST: Retrieval-Based Speculative Decoding
Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D. Lee, and Di He

TL;DR
REST is a new retrieval-based decoding algorithm that accelerates language model generation by leveraging relevant token retrieval, achieving over 1.6x to 2.4x speedup without extra training.
Contribution
REST introduces a retrieval-based speculative decoding method that seamlessly accelerates language models without additional training or modifications.
Findings
Achieves 1.62X to 2.36X speedup on 7B and 13B models
Works with code and text generation tasks
Plug-and-play integration with existing models
Abstract
We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm designed to speed up language model generation. The key insight driving the development of REST is the observation that the process of text generation often includes certain common phases and patterns. Unlike previous methods that rely on a draft language model for speculative decoding, REST harnesses the power of retrieval to generate draft tokens. This method draws from the reservoir of existing knowledge, retrieving and employing relevant tokens based on the current context. Its plug-and-play nature allows for seamless integration and acceleration of any language models, all without necessitating additional training. When benchmarked on 7B and 13B language models in a single-batch setting, REST achieves a significant speedup of 1.62X to 2.36X on code or text generation. The code of REST is available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
