DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure
Yunfan Xiong, Ruoyu Zhang, Yanzeng Li, Tianhao Wu, and Lei Zou

TL;DR
DySpec introduces a dynamic token tree structure for speculative decoding that adaptively expands during inference, significantly increasing speed and efficiency for large language models across various conditions.
Contribution
The paper proposes DySpec, a novel dynamic token tree method that improves speculative decoding by adaptively expanding the token structure based on acceptance rates, outperforming fixed tree approaches.
Findings
DySpec achieves up to 9.1× throughput improvement on Llama2-70B.
DySpec reduces latency by up to 9.4× compared to existing methods.
DySpec outperforms Specinfer and Sequoia across various data distributions and model sizes.
Abstract
While speculative decoding has recently appeared as a promising direction for accelerating the inference of large language models (LLMs), the speedup and scalability are strongly bounded by the token acceptance rate. Prevalent methods usually organize predicted tokens as independent chains or fixed token trees, which fails to generalize to diverse query distributions. In this paper, we propose DySpec, a faster speculative decoding algorithm with a novel dynamic token tree structure. We begin by bridging the draft distribution and acceptance rate from intuitive and empirical clues, and successfully show that the two variables are strongly correlated. Based on this, we employ a greedy strategy to dynamically expand the token tree at run time. Theoretically, we show that our method can achieve optimal results under mild assumptions. Empirically, DySpec yields a higher acceptance rate and…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The correlation of Hypothesis 1 with the proposed method is well elaborated and explained. 2. The structure of the paper is clear and easy to follow.
1. Lack of related work. Context-aware dynamic draft token tree is not a new idea. I would like to draw your attention to a very related work: “EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees” (EMNLP'24). This paper also proposes adopting dynamic draft token trees and claims a strong positive correlation between the draft model confidence score and the acceptance rate of the token. Another relevant paper is “Dynamic Depth Decoding: Faster Speculative Decoding for LLMs,” whi
1. Strong Theoretical Foundation - Provides rigorous theoretical analysis linking draft distribution to acceptance rate - Includes formal proofs of optimality under stated assumptions - Clearly bridges theoretical insights with practical implementation 2. Novel Technical Contributions - Introduces an innovative dynamic token tree construction approach - Develops efficient algorithms for both fixed-size and threshold-based tree construction - Proposes block-sparsity friendly token ordering for o
1. Limited Discussion of Limitations - Could elaborate more on scenarios where the method might not perform optimally - More discussion of the trade-offs between fixed-size and threshold-based approaches would be valuable 2. Implementation Details - Some implementation specifics about the C++ optimizations could be expanded - Could provide more guidance on threshold selection for different scenarios 3. Experimental Validation - Could include more ablation studies to isolate the impact of diffe
- The proposed idea is quite useful and can provide significant speedups for tree based speculative decoding methods. - The overheads reported our negligible.
- See the summary section please.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCoding theory and cryptography · Cryptographic Implementations and Security · Interconnection Networks and Systems
