SMART: When is it Actually Worth Expanding a Speculative Tree?

Lifu Wang; Pan Zhou

arXiv:2604.09731·cs.DC·April 14, 2026

SMART: When is it Actually Worth Expanding a Speculative Tree?

Lifu Wang, Pan Zhou

PDF

TL;DR

SMART is a runtime optimization framework that intelligently expands speculative decoding trees to maximize end-to-end speedup, outperforming existing methods without retraining.

Contribution

It introduces a hardware-aware marginal analysis approach for dynamic tree expansion, improving decoding speed in large language models without additional training.

Findings

01

SMART achieves an average 20.0% speedup for MLLMs.

02

SMART delivers a 15.4% speedup for LLMs.

03

It outperforms state-of-the-art baselines across diverse models and hardware.

Abstract

Tree-based speculative decoding accelerates autoregressive generation by verifying a branching tree of draft tokens in a single target-model forward pass. However, existing methods prioritize maximizing token-level likelihood or the number of accepted tokens while ignoring a critical ``efficiency paradox'': the computational overhead of drafting and verifying big trees can grow super-linearly, particularly at scale. This often leads to negative wall-clock speedup when batch sizes increase or hardware saturation limits are reached. To address this, we propose SMART, a system-aware marginal analysis framework for runtime tree construction. SMART reformulates tree expansion as a hardware-aware optimization problem that directly maximizes end-to-end speedup. By applying a principled marginal benefit--cost rule at inference time, SMART expands a node only when its marginal benefit--cost…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.