SAGE: Accelerating Vision-Language Models via Entropy-Guided Adaptive Speculative Decoding

Yujia Tong; Tian Zhang; Yunyang Wan; Kaiwei Lin; Jingling Yuan; Chuang Hu

arXiv:2602.00523·cs.CV·February 3, 2026

SAGE: Accelerating Vision-Language Models via Entropy-Guided Adaptive Speculative Decoding

Yujia Tong, Tian Zhang, Yunyang Wan, Kaiwei Lin, Jingling Yuan, Chuang Hu

PDF

Open Access

TL;DR

SAGE introduces a dynamic, entropy-guided speculative decoding framework that adapts the tree structure in real-time to improve inference speed in vision-language models without sacrificing output quality.

Contribution

It proposes a novel adaptive speculation tree mechanism based on entropy to optimize decoding efficiency in vision-language models.

Findings

01

Achieves up to 3.36x speedup on LLaVA-OneVision-72B

02

Achieves up to 3.18x speedup on Qwen2.5-VL-72B

03

Maintains output quality while accelerating inference

Abstract

Speculative decoding has emerged as a promising approach to accelerate inference in vision-language models (VLMs) by enabling parallel verification of multiple draft tokens. However, existing methods rely on static tree structures that remain fixed throughout the decoding process, failing to adapt to the varying prediction difficulty across generation steps. This leads to suboptimal acceptance lengths and limited speedup. In this paper, we propose SAGE, a novel framework that dynamically adjusts the speculation tree structure based on real-time prediction uncertainty. Our key insight is that output entropy serves as a natural confidence indicator with strong temporal correlation across decoding steps. SAGE constructs deeper-narrower trees for high-confidence predictions to maximize speculation depth, and shallower-wider trees for uncertain predictions to diversify exploration. SAGE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning