Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models
Chen Zhang, Zhuorui Liu, Dawei Song

TL;DR
This survey reviews the emerging field of speculative execution techniques in large language models, highlighting their potential to significantly improve decoding speed by parallelizing inference, and discusses future research directions.
Contribution
First comprehensive survey unifying and analyzing speculative execution methods in LLMs within a systematic framework and taxonomy.
Findings
Speculative execution can greatly accelerate LLM decoding.
Current methods include blockwise and speculative decoding techniques.
Challenges include balancing accuracy and speed in speculative approaches.
Abstract
With the increasingly giant scales of (causal) large language models (LLMs), the inference efficiency comes as one of the core concerns along the improved performance. In contrast to the memory footprint, the latency bottleneck seems to be of greater importance as there can be billions of requests to a LLM (e.g., GPT-4) per day. The bottleneck is mainly due to the autoregressive innateness of LLMs, where tokens can only be generated sequentially during decoding. To alleviate the bottleneck, the idea of speculative execution, which originates from the field of computer architecture, is introduced to LLM decoding in a \textit{draft-then-verify} style. Under this regime, a sequence of tokens will be drafted in a fast pace by utilizing some heuristics, and then the tokens shall be verified in parallel by the LLM. As the costly sequential inference is parallelized, LLM decoding speed can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
