Faster Cascades via Speculative Decoding
Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat,, Seungyeon Kim, Neha Gupta, Aditya Krishna Menon, Sanjiv Kumar

TL;DR
This paper introduces a novel speculative cascading method that combines cascades and speculative decoding, achieving superior inference efficiency and cost-quality trade-offs in language models.
Contribution
It designs new speculative cascading techniques, characterizes the optimal deferral rule, and demonstrates improved performance over existing methods.
Findings
Better cost-quality trade-offs than baseline methods
Outperforms traditional cascades and speculative decoding in experiments
Achieves improved inference efficiency with language models
Abstract
Cascades and speculative decoding are two common approaches to improving language models' inference efficiency. Both approaches involve interleaving models of different sizes, but via fundamentally distinct mechanisms: cascades employ a deferral rule that invokes the larger model only for "hard" inputs, while speculative decoding uses speculative execution to primarily invoke the larger model in parallel verification mode. These mechanisms offer different benefits: empirically, cascades offer better cost-quality trade-offs, often even outperforming the large model, while theoretically, speculative decoding offers a guarantee of quality-neutrality. In this paper, we leverage the best of both these approaches by designing new speculative cascading techniques that implement their deferral rule through speculative execution. We characterize the optimal deferral rule for our speculative…
Peer Reviews
Decision·ICLR 2025 Oral
- The proposed method is lightweight and does not require supervised fine-tuning. - The paper flows naturally and is easy to read and understand overall.
- The reasoning behind the designs of the loss functions in Equations (3) and (8) is unclear. - The experimental design seems unfair and the improvements are limited. SpecCascade [Token] uses top tokens, while the other methods are evaluated with vanilla sampling at a temperature of 1. Sampling methods like Top-K and Top-P can lead to better performance by avoiding out-of-distribution tokens. To ensure a fair comparison, baseline methods should also use Top-K or Top-P sampling. Given the experi
1. The paper addresses a promising and challenging research direction by combining model cascades with speculative decoding. Recent studies have shown interest in integrating speculative decoding with advanced techniques, such as contrastive decoding [1], to accelerate inference while enhancing the generation quality of LLMs. Speculative cascading complements these efforts by exploring model cascades in speculative decoding. Through both empirical and theoretical analyses, this work innovativel
1. **Fairness of Comparison**: In Table 2, the authors report minimal latency when matching the quality of a large model, as well as the best quality metric achievable without exceeding the latency of LLMs for each method. However, it is unclear if these comparisons are entirely fair. For instance, it would be helpful to know if the results for BiLD were reported under similar configurations, ensuring a consistent basis for comparison. 2. **Applicability of Speculative Cascading**: Figures 2 and
- An interesting and intuitive approach: in cascading, the small model can sometimes outperform the larger one. In contrast, in speculative decoding, the model is guaranteed to match the large model quality, but is typically faster. By combining them, the authors allow for fast decoding, with potential improvement, which leads to overall higher speedup. - Some of the claims are clever. I particularly like the intuition that only considering the confidence of the small model is sub-optimal, and w
* I had some trouble following section 4.3. A roadmap/intuition would have been helpful. In particular, I did not fully understand the role of Lemma 3, and the overall takeaway from this section. - The experiments section was also a bit hard to follow. It starts with outlying the different deferral rules, and then presents the baselines. It seems both parts are somewhat overlapping. It would be helpful to merge them and discuss the link between the two, and particularly not have a paragraph sep
Code & Models
Videos
Taxonomy
TopicsNeural Networks and Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · Layer Normalization · SentencePiece · Gated Linear Unit · Attention Dropout · Linear Layer · Residual Connection · Multi-Head Attention
