TL;DR
This paper introduces XComp, a novel extreme compression method for long video understanding that combines token-level and frame-level strategies, significantly improving efficiency and accuracy in vision-language models.
Contribution
It proposes learnable, progressive token compression modules and question-conditioned frame selection, enabling dense frame sampling with minimal data and outperforming previous heuristic methods.
Findings
XComp achieves 2x-4x more frames processed with better performance.
Finetuning on only 2.5% of data boosts accuracy from 42.9% to 46.2%.
XComp outperforms previous methods on LVBench and other benchmarks.
Abstract
Long video understanding is inherently challenging for vision-language models (VLMs) because of the extensive number of frames. With each video frame typically expanding into tens or hundreds of tokens, the limited context length of large language models (LLMs) forces the VLMs to perceive the frames sparsely and lose temporal information. To address this, we explore extreme video token compression towards one token per frame at the final LLM layer. Our key insight is that heuristic-based compression, widely adopted by previous methods, is prone to information loss, and this necessitates supervising LLM layers into learnable and progressive modules for token-level compression (LP-Comp). Such compression enables our VLM to digest 2x-4x more frames with improved performance. To further increase the token efficiency, we investigate frame-level compression, which selects the frames most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
