Cascade Speculative Drafting for Even Faster LLM Inference
Ziyi Chen, Xiaocong Yang, Jiacheng Lin, Chenkai Sun, Kevin Chen-Chuan Chang, Jie Huang

TL;DR
This paper proposes Cascade Speculative Drafting, an improved speculative decoding method for large language models that significantly increases inference speed by eliminating autoregressive generation and optimizing token drafting time.
Contribution
It introduces a novel cascade-based speculative decoding algorithm that removes autoregressive generation and optimizes token drafting, achieving faster inference without changing output distribution.
Findings
Achieves greater speedup than baseline methods
Preserves the same output distribution as the target model
Effectively reduces inference time in large language models
Abstract
Introduced to enhance the efficiency of large language model (LLM) inference, speculative decoding operates by having a smaller model generate a draft. A larger target model then reviews this draft to align with its output, and any acceptance by the target model results in a reduction of the number of the target model runs, ultimately improving efficiency. However, the drafting process in speculative decoding includes slow autoregressive generation and allocates equal time to generating tokens, irrespective of their importance. These inefficiencies collectively contribute to the suboptimal performance of speculative decoding. To further improve LLM inference, we introduce Cascade Speculative Drafting (CS Drafting), a speculative execution algorithm that incorporates two types of cascades. The Vertical Cascade eliminates autoregressive generation from neural models, while the Horizontal…
Peer Reviews
Decision·NeurIPS 2024 poster
**Originality** This paper demonstrates originality by expanding on speculative decoding. It introduces two novel techniques - horizontal and vertical cascades - that effectively factorize the draft-and-verification steps of speculative decoding across multiple models. **Quality and Clarity** The authors base their approach in intuitive assumptions, like the complexity of first token generation. The speed improvements over vanilla speculative decoding illustrate the effectiveness of the propose
**Algorithmic Complexity**: The paper discusses the use of horizontal and vertical cascades, but each additional cascade increases the algorithmic complexity of the decoding process. This complexity can become particularly challenging with larger models. The paper does not address how to balance this increased complexity with the potential speedup gains. Important questions such as the optimal number of cascades and the comparative usefulness of horizontal versus vertical cascades remain unanswe
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Algorithms
MethodsALIGN
