REST: Retrieval-Based Speculative Decoding

Zhenyu He; Zexuan Zhong; Tianle Cai; Jason D. Lee; and Di He

arXiv:2311.08252·cs.CL·April 5, 2024·1 cites

REST: Retrieval-Based Speculative Decoding

Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D. Lee, and Di He

PDF

Open Access 1 Repo 1 Video

TL;DR

REST is a new retrieval-based decoding algorithm that accelerates language model generation by leveraging relevant token retrieval, achieving over 1.6x to 2.4x speedup without extra training.

Contribution

REST introduces a retrieval-based speculative decoding method that seamlessly accelerates language models without additional training or modifications.

Findings

01

Achieves 1.62X to 2.36X speedup on 7B and 13B models

02

Works with code and text generation tasks

03

Plug-and-play integration with existing models

Abstract

We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm designed to speed up language model generation. The key insight driving the development of REST is the observation that the process of text generation often includes certain common phases and patterns. Unlike previous methods that rely on a draft language model for speculative decoding, REST harnesses the power of retrieval to generate draft tokens. This method draws from the reservoir of existing knowledge, retrieving and employing relevant tokens based on the current context. Its plug-and-play nature allows for seamless integration and acceleration of any language models, all without necessitating additional training. When benchmarked on 7B and 13B language models in a single-batch setting, REST achieves a significant speedup of 1.62X to 2.36X on code or text generation. The code of REST is available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fasterdecoding/rest
pytorchOfficial

Videos

REST: Retrieval-Based Speculative Decoding· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings