SMART: When is it Actually Worth Expanding a Speculative Tree?
Lifu Wang, Pan Zhou

TL;DR
SMART is a runtime optimization framework that intelligently expands speculative decoding trees to maximize end-to-end speedup, outperforming existing methods without retraining.
Contribution
It introduces a hardware-aware marginal analysis approach for dynamic tree expansion, improving decoding speed in large language models without additional training.
Findings
SMART achieves an average 20.0% speedup for MLLMs.
SMART delivers a 15.4% speedup for LLMs.
It outperforms state-of-the-art baselines across diverse models and hardware.
Abstract
Tree-based speculative decoding accelerates autoregressive generation by verifying a branching tree of draft tokens in a single target-model forward pass. However, existing methods prioritize maximizing token-level likelihood or the number of accepted tokens while ignoring a critical ``efficiency paradox'': the computational overhead of drafting and verifying big trees can grow super-linearly, particularly at scale. This often leads to negative wall-clock speedup when batch sizes increase or hardware saturation limits are reached. To address this, we propose SMART, a system-aware marginal analysis framework for runtime tree construction. SMART reformulates tree expansion as a hardware-aware optimization problem that directly maximizes end-to-end speedup. By applying a principled marginal benefit--cost rule at inference time, SMART expands a node only when its marginal benefit--cost…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
