Stateful Token Reduction for Long-Video Hybrid VLMs

Jindong Jiang; Amala Sanjay Deshmukh; Kateryna Chumachenko; Karan Sapra; Zhiding Yu; Guilin Liu; Andrew Tao; Pavlo Molchanov; Jan Kautz; Wonmin Byeon

arXiv:2603.00198·cs.CV·March 3, 2026

Stateful Token Reduction for Long-Video Hybrid VLMs

Jindong Jiang, Amala Sanjay Deshmukh, Kateryna Chumachenko, Karan Sapra, Zhiding Yu, Guilin Liu, Andrew Tao, Pavlo Molchanov, Jan Kautz, Wonmin Byeon

PDF

Open Access

TL;DR

This paper introduces a progressive token reduction method for hybrid long-video vision-language models, achieving significant speedups with minimal accuracy loss by unifying importance scoring across attention and Mamba blocks.

Contribution

It proposes a novel low-to-high progressive reduction schedule and a unified language-aware scoring mechanism for hybrid VLMs, enabling effective all-layer token reduction.

Findings

01

Achieves 3.8--4.2x speedup with near-baseline accuracy

02

Effective token importance estimation across layers

03

Light finetuning improves long-video benchmark performance

Abstract

Token reduction is an effective way to accelerate long-video vision-language models (VLMs), but most existing methods are designed for dense Transformers and do not directly account for hybrid architectures that interleave attention with linear-time state-space blocks (e.g., Mamba). We study query-conditioned token reduction for hybrid video VLMs and analyze reduction behavior through two properties: layerwise sparsity (how many tokens capture query-relevant information) and importance stability (whether token-importance rankings persist across depth). Although token importance is sparse within each layer, the set of important tokens changes across layers, so aggressive early pruning is unreliable. Motivated by this, we propose a low-to-high progressive reduction schedule and a unified language-aware scoring mechanism for both attention and Mamba blocks (using an implicit-attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications