StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding

Xinqi Jin; Hanxun Yu; Bohan Yu; Kebin Liu; Jian Liu; Keda Tao; Yixuan Pei; Huan Wang; Fan Dang; Jiangchuan Liu; Weiqiang Wang

arXiv:2512.12560·cs.CV·December 16, 2025

StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding

Xinqi Jin, Hanxun Yu, Bohan Yu, Kebin Liu, Jian Liu, Keda Tao, Yixuan Pei, Huan Wang, Fan Dang, Jiangchuan Liu, Weiqiang Wang

PDF

Open Access

TL;DR

StreamingAssistant introduces a novel token pruning method for online video understanding that reduces computational load while maintaining high accuracy, enabling more efficient processing of video data with minimal latency.

Contribution

It proposes a new redundancy metric MSSAVT and a masked pruning strategy to effectively prune video tokens, improving efficiency without sacrificing accuracy.

Findings

01

Achieves up to 4% accuracy improvement on benchmarks.

02

Pruning latency is less than 1ms, enabling real-time processing.

03

Effectively reduces GPU memory usage for online video understanding.

Abstract

Online video understanding is essential for applications like public surveillance and AI glasses. However, applying Multimodal Large Language Models (MLLMs) to this domain is challenging due to the large number of video frames, resulting in high GPU memory usage and computational latency. To address these challenges, we propose token pruning as a means to reduce context length while retaining critical information. Specifically, we introduce a novel redundancy metric, Maximum Similarity to Spatially Adjacent Video Tokens (MSSAVT), which accounts for both token similarity and spatial position. To mitigate the bidirectional dependency between pruning and redundancy, we further design a masked pruning strategy that ensures only mutually unadjacent tokens are pruned. We also integrate an existing temporal redundancy-based pruning method to eliminate temporal redundancy of the video modality.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis