Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models
Jonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, and Nadav Timor, Moshe Wasserblat, Roy Schwartz

TL;DR
This paper introduces DISCO, a dynamic method for selecting speculation lookahead in large language models, significantly improving inference speed without sacrificing output quality.
Contribution
DISCO is the first approach to dynamically optimize speculation lookahead, outperforming static methods in large language model decoding.
Findings
Achieves 10% average speedup over static lookahead methods
Maintains identical text output to static methods
Demonstrates effectiveness across four datasets
Abstract
Speculative decoding is commonly used for reducing the inference latency of large language models. Its effectiveness depends highly on the speculation lookahead (SL)-the number of tokens generated by the draft model at each iteration. In this work we show that the common practice of using the same SL for all iterations (static SL) is suboptimal. We introduce DISCO (DynamIc SpeCulation lookahead Optimization), a novel method for dynamically selecting the SL. Our experiments with four datasets show that DISCO reaches an average speedup of 10% compared to the best static SL baseline, while generating the exact same text.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsError Correcting Code Techniques · Advanced Data Compression Techniques · Chaos-based Image/Signal Encryption
