HIPPO: Accelerating Video Large Language Models Inference via Holistic-aware Parallel Speculative Decoding

Qitan Lv; Tianyu Liu; Wen Wu; Xuenan Xu; Bowen Zhou; Feng Wu; Chao Zhang

arXiv:2601.08273·cs.CV·January 14, 2026

HIPPO: Accelerating Video Large Language Models Inference via Holistic-aware Parallel Speculative Decoding

Qitan Lv, Tianyu Liu, Wen Wu, Xuenan Xu, Bowen Zhou, Feng Wu, Chao Zhang

PDF

Open Access

TL;DR

HIPPO introduces a holistic-aware parallel speculative decoding framework that significantly accelerates video large language model inference by preserving semantic tokens and overlapping decoding phases, achieving up to 3.51x speedup.

Contribution

HIPPO's novel semantic-aware token preservation and parallel decoding strategies address previous limitations, enabling faster video-LLM inference without quality loss.

Findings

01

Up to 3.51x speedup over standard decoding.

02

Effective preservation of semantic tokens at high pruning ratios.

03

Validated on four video-LLMs across six benchmarks.

Abstract

Speculative decoding (SD) has emerged as a promising approach to accelerate LLM inference without sacrificing output quality. Existing SD methods tailored for video-LLMs primarily focus on pruning redundant visual tokens to mitigate the computational burden of massive visual inputs. However, existing methods do not achieve inference acceleration comparable to text-only LLMs. We observe from extensive experiments that this phenomenon mainly stems from two limitations: (i) their pruning strategies inadequately preserve visual semantic tokens, degrading draft quality and acceptance rates; (ii) even with aggressive pruning (e.g., 90% visual tokens removed), the draft model's remaining inference cost limits overall speedup. To address these limitations, we propose HIPPO, a general holistic-aware parallel speculative decoding framework. Specifically, HIPPO proposes (i) a semantic-aware token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning