Clapper: Compact Learning and Video Representation in VLMs
Lingyu Kong, Hongzhi Zhang, Jingyuan Zhang, Jianzhao Huang, Kunze Li, Qi Wang, Fuzheng Zhang

TL;DR
Clapper introduces a novel slow-fast video representation method with TimePerceiver for efficient temporal-spatial encoding, enabling high-performance video understanding with significantly reduced visual tokens.
Contribution
The paper proposes Clapper, a new approach combining slow-fast strategies and TimePerceiver to improve long and short video modeling in VLMs with substantial token compression.
Findings
Achieves 13x token compression without accuracy loss
Outperforms existing models on VideoMME, MLVU, TempCompass benchmarks
Uses fewer than 6,000 visual tokens per video
Abstract
Current vision-language models (VLMs) have demonstrated remarkable capabilities across diverse video understanding applications. Designing VLMs for video inputs requires effectively modeling the temporal dimension (i.e. capturing dependencies across frames) and balancing the processing of short and long videos. Specifically, short videos demand preservation of fine-grained details, whereas long videos require strategic compression of visual information to handle extensive temporal contexts efficiently. However, our empirical analysis reveals a critical limitation: most existing VLMs suffer severe performance degradation in long video understanding tasks when compressing visual tokens below a quarter of their original visual tokens. To enable more effective modeling of both short and long video inputs, we propose Clapper, a method that utilizes a slow-fast strategy for video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Reservoir Computing
