Beyond the Speculative Game: A Survey of Speculative Execution in Large   Language Models

Chen Zhang; Zhuorui Liu; Dawei Song

arXiv:2404.14897·cs.CL·April 24, 2024

Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models

Chen Zhang, Zhuorui Liu, Dawei Song

PDF

Open Access

TL;DR

This survey reviews the emerging field of speculative execution techniques in large language models, highlighting their potential to significantly improve decoding speed by parallelizing inference, and discusses future research directions.

Contribution

First comprehensive survey unifying and analyzing speculative execution methods in LLMs within a systematic framework and taxonomy.

Findings

01

Speculative execution can greatly accelerate LLM decoding.

02

Current methods include blockwise and speculative decoding techniques.

03

Challenges include balancing accuracy and speed in speculative approaches.

Abstract

With the increasingly giant scales of (causal) large language models (LLMs), the inference efficiency comes as one of the core concerns along the improved performance. In contrast to the memory footprint, the latency bottleneck seems to be of greater importance as there can be billions of requests to a LLM (e.g., GPT-4) per day. The bottleneck is mainly due to the autoregressive innateness of LLMs, where tokens can only be generated sequentially during decoding. To alleviate the bottleneck, the idea of speculative execution, which originates from the field of computer architecture, is introduced to LLM decoding in a \textit{draft-then-verify} style. Under this regime, a sequence of tokens will be drafted in a fast pace by utilizing some heuristics, and then the tokens shall be verified in parallel by the LLM. As the costly sequential inference is parallelized, LLM decoding speed can be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings