Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention
Junhao Du, Jialong Xue, Anqi Li, Jincheng Dai, Guo Lu

TL;DR
This paper introduces a unified spatiotemporal token compression method for Video-LLMs that significantly reduces computational costs while maintaining high performance, by globally selecting and merging tokens based on attention and semantic relevance.
Contribution
It proposes a novel unified token selection and merging strategy that operates without retraining, improving efficiency in Video-LLMs at ultra-low token retention levels.
Findings
Retaining about 2% of tokens preserves over 90% of baseline performance.
Reduces FLOPs to approximately 2.6% of original.
Decreases inference latency and memory usage across various models.
Abstract
Video large language models (Video-LLMs) face high computational costs due to large volumes of visual tokens. Existing token compression methods typically adopt a two-stage spatiotemporal compression strategy, relying on stage-specific metrics and an implicit assumption of spatiotemporal separability. Under extremely low retention ratios, however, such approaches often result in unbalanced allocation and loss of visual evidence essential for question answering. We reformulate token compression as a spatiotemporal allocation task within a global token retention pool. We propose a unified selection mechanism that integrates attention weights and semantic similarity to globally select tokens with high contribution and low redundancy. Unselected tokens are merged via clustering and refilled, preserving information integrity. Inside the LLM, we further introduce text-aware merging to perform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
