LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs
Boyuan Sun, Jiaxing Zhao, Xihan Wei, Qibin Hou

TL;DR
LLaVA-Scissor introduces a semantic connected components-based token compression method for video LLMs, effectively reducing tokens while maintaining semantic coverage and improving performance on various video understanding tasks.
Contribution
It proposes a novel, training-free token compression strategy using semantic connected components for better semantic coverage in video LLMs.
Findings
Outperforms existing token compression methods in benchmarks.
Maintains high performance at low token retention ratios.
Effective in diverse video understanding tasks.
Abstract
In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic regions and often lead to token redundancy. Differently, we propose to leverage the Semantic Connected Components (SCC) approach that assigns tokens to distinct semantic regions within the token set, ensuring comprehensive semantic coverage. The outcome is a two-step spatio-temporal token compression strategy that utilizes SCC in both spatial and temporal domains. This strategy can effectively compress tokens by representing the entire video with a set of non-overlapping semantic tokens. We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks, including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
MethodsSparse Evolutionary Training
