Clapper: Compact Learning and Video Representation in VLMs

Lingyu Kong; Hongzhi Zhang; Jingyuan Zhang; Jianzhao Huang; Kunze Li; Qi Wang; Fuzheng Zhang

arXiv:2505.15529·cs.CV·May 22, 2025

Clapper: Compact Learning and Video Representation in VLMs

Lingyu Kong, Hongzhi Zhang, Jingyuan Zhang, Jianzhao Huang, Kunze Li, Qi Wang, Fuzheng Zhang

PDF

Open Access

TL;DR

Clapper introduces a novel slow-fast video representation method with TimePerceiver for efficient temporal-spatial encoding, enabling high-performance video understanding with significantly reduced visual tokens.

Contribution

The paper proposes Clapper, a new approach combining slow-fast strategies and TimePerceiver to improve long and short video modeling in VLMs with substantial token compression.

Findings

01

Achieves 13x token compression without accuracy loss

02

Outperforms existing models on VideoMME, MLVU, TempCompass benchmarks

03

Uses fewer than 6,000 visual tokens per video

Abstract

Current vision-language models (VLMs) have demonstrated remarkable capabilities across diverse video understanding applications. Designing VLMs for video inputs requires effectively modeling the temporal dimension (i.e. capturing dependencies across frames) and balancing the processing of short and long videos. Specifically, short videos demand preservation of fine-grained details, whereas long videos require strategic compression of visual information to handle extensive temporal contexts efficiently. However, our empirical analysis reveals a critical limitation: most existing VLMs suffer severe performance degradation in long video understanding tasks when compressing visual tokens below a quarter of their original visual tokens. To enable more effective modeling of both short and long video inputs, we propose Clapper, a method that utilizes a slow-fast strategy for video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Reservoir Computing