Cascade Speculative Drafting for Even Faster LLM Inference

Ziyi Chen; Xiaocong Yang; Jiacheng Lin; Chenkai Sun; Kevin Chen-Chuan Chang; Jie Huang

arXiv:2312.11462·cs.LG·July 15, 2025·2 cites

Cascade Speculative Drafting for Even Faster LLM Inference

Ziyi Chen, Xiaocong Yang, Jiacheng Lin, Chenkai Sun, Kevin Chen-Chuan Chang, Jie Huang

PDF

Open Access 1 Repo 1 Video 1 Reviews

TL;DR

This paper proposes Cascade Speculative Drafting, an improved speculative decoding method for large language models that significantly increases inference speed by eliminating autoregressive generation and optimizing token drafting time.

Contribution

It introduces a novel cascade-based speculative decoding algorithm that removes autoregressive generation and optimizes token drafting, achieving faster inference without changing output distribution.

Findings

01

Achieves greater speedup than baseline methods

02

Preserves the same output distribution as the target model

03

Effectively reduces inference time in large language models

Abstract

Introduced to enhance the efficiency of large language model (LLM) inference, speculative decoding operates by having a smaller model generate a draft. A larger target model then reviews this draft to align with its output, and any acceptance by the target model results in a reduction of the number of the target model runs, ultimately improving efficiency. However, the drafting process in speculative decoding includes slow autoregressive generation and allocates equal time to generating tokens, irrespective of their importance. These inefficiencies collectively contribute to the suboptimal performance of speculative decoding. To further improve LLM inference, we introduce Cascade Speculative Drafting (CS Drafting), a speculative execution algorithm that incorporates two types of cascades. The Vertical Cascade eliminates autoregressive generation from neural models, while the Horizontal…

Peer Reviews

Decision·NeurIPS 2024 poster

Reviewer 01Rating 6Confidence 3

Strengths

**Originality** This paper demonstrates originality by expanding on speculative decoding. It introduces two novel techniques - horizontal and vertical cascades - that effectively factorize the draft-and-verification steps of speculative decoding across multiple models. **Quality and Clarity** The authors base their approach in intuitive assumptions, like the complexity of first token generation. The speed improvements over vanilla speculative decoding illustrate the effectiveness of the propose

Weaknesses

**Algorithmic Complexity**: The paper discusses the use of horizontal and vertical cascades, but each additional cascade increases the algorithmic complexity of the decoding process. This complexity can become particularly challenging with larger models. The paper does not address how to balance this increased complexity with the potential speedup gains. Important questions such as the optimal number of cascades and the comparative usefulness of horizontal versus vertical cascades remain unanswe

Code & Models

Repositories

lfsszd/cs-drafting
pytorchOfficial

Videos

Cascade Speculative Drafting for Even Faster LLM Inference· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Algorithms

MethodsALIGN