TL;DR
This paper presents CAST, a dynamic tree decoding method that optimizes inference speed in large language models by considering system variables, achieving significant speedups over existing methods across multiple tasks and models.
Contribution
CAST introduces a cost-aware dynamic tree decoding approach that adapts to system variables like GPU and batch size, improving inference efficiency in large language models.
Findings
Achieves up to 5.2x faster inference speeds.
Outperforms existing methods by 5% to 20%.
Validated across six tasks and six LLMs.
Abstract
Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size. To address this, speculative decoding emerges as a solution, enabling the simultaneous generation and validation of multiple tokens. While recent approaches like EAGLE-2 and EAGLE-3 improve speculative decoding using dynamic tree structures, they often neglect the impact of crucial system variables such as GPU devices and batch sizes. Therefore, we introduce a new dynamic tree decoding approach called CAST that takes into account inference costs, including factors such as GPU configurations and batch sizes, to dynamically refine the tree structure. Through comprehensive experimentation across six diverse tasks and utilizing six distinct LLMs, our methodology demonstrates remarkable results, achieving speeds up to 5.2 times faster than conventional…
Peer Reviews
Decision·ICLR 2026 Poster
- The idea of using utility function to consider resource problem in the speculative decoding setting is novel and motivated. And the paper is well organized. - The way the paper formulates the utility function based on acceptance rate and how to choose the depth is new. - The experimental results are comprehensive and convincing. The authors validate their CAST method across a wide array of 6 distinct LLMs and 6 diverse tasks, ranging from multi-turn conversation to code generation.
- The method introduces precomputation overhead. If the hardware, batching strategy, or even the model (which changes the cost profile) is modified, this entire precomputation step must be redone. - There are multiple thresholds (at three new cost-utility thresholds) for tuning. What are the overheads? - The paper considers batch size as a factor and motivation but lacks more comprehensive experiments for that. - Theorem 4.1: Try to make it self-contained. What does j mean in the formula c_j.
1. The method replaces prior heuristics with a novel and principled cost-utility framework. This formal optimization of acceptance "utility" versus hardware "cost" is a more robust and generalizable approach . 2. Comprehensive and Rigorous Experimentation: Claims are exceptionally well-supported by extensive experiments across 6 models, 6 tasks, and 3 different GPU architectures. This thoroughness confirms the method's effectiveness and generality . 3. The authors correctly use "Speedup Ratio"
1. Unquantified Profiling Overhead: The method relies on pre-computing cost lookup tables, but the paper never quantifies the one-time profiling cost (e.g., in GPU-hours), which could be a significant practical barrier to adoption. 2. Lack of Hyperparameter Sensitivity Analysis: The new thresholds ($C_1, C_2, C_3$) are critical to the method, but their robustness and the strategy for tuning them are not discussed, leaving a key practical question unanswered. 3. Unclear Intuition for Generaliza
- It correctly identifies that SOTA dynamic tree methods ignore critical system costs like batch size and GPU type, which can negate speedups . The proposed cost-utility model, which uses precomputed lookup tables to guide tree construction, is an good solution. - A key contribution is the generalization of prior SOTA (EAGLE-2/3), demonstrating they are special cases of this new framework (Theorem 4.1) . The empirical results are strong, showing consistent 5-20% gains over EAGLE-3 and demonstra
- One of the weakness is the reliance on a new set of hyperparameters, specifically the cost thresholds $C_1$, $C_2$, and $C_3$ and the buffer size $R$, whose selection and sensitivity are not discussed or ablated. - The method's practicality hinges on pre-computing cost-lookup tables $S_T(B)$ and $S_D(B)$. While practical, the paper does not sufficiently analyze the cost and complexity of this profiling step, which must be run for different hardware and batching configurations. It is unclear h
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
