SAGE: Accelerating Vision-Language Models via Entropy-Guided Adaptive Speculative Decoding
Yujia Tong, Tian Zhang, Yunyang Wan, Kaiwei Lin, Jingling Yuan, Chuang Hu

TL;DR
SAGE introduces a dynamic, entropy-guided speculative decoding framework that adapts the tree structure in real-time to improve inference speed in vision-language models without sacrificing output quality.
Contribution
It proposes a novel adaptive speculation tree mechanism based on entropy to optimize decoding efficiency in vision-language models.
Findings
Achieves up to 3.36x speedup on LLaVA-OneVision-72B
Achieves up to 3.18x speedup on Qwen2.5-VL-72B
Maintains output quality while accelerating inference
Abstract
Speculative decoding has emerged as a promising approach to accelerate inference in vision-language models (VLMs) by enabling parallel verification of multiple draft tokens. However, existing methods rely on static tree structures that remain fixed throughout the decoding process, failing to adapt to the varying prediction difficulty across generation steps. This leads to suboptimal acceptance lengths and limited speedup. In this paper, we propose SAGE, a novel framework that dynamically adjusts the speculation tree structure based on real-time prediction uncertainty. Our key insight is that output entropy serves as a natural confidence indicator with strong temporal correlation across decoding steps. SAGE constructs deeper-narrower trees for high-confidence predictions to maximize speculation depth, and shallower-wider trees for uncertain predictions to diversify exploration. SAGE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning
