One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

Zheyu Zhang; Ziqi Pang; Shixing Chen; Xiang Hao; Vimal Bhat; Yu-Xiong Wang

arXiv:2604.14149·cs.CV·April 17, 2026

One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

Zheyu Zhang, Ziqi Pang, Shixing Chen, Xiang Hao, Vimal Bhat, Yu-Xiong Wang

PDF

1 Video

TL;DR

This paper introduces XComp, a novel extreme compression method for long video understanding that combines token-level and frame-level strategies, significantly improving efficiency and accuracy in vision-language models.

Contribution

It proposes learnable, progressive token compression modules and question-conditioned frame selection, enabling dense frame sampling with minimal data and outperforming previous heuristic methods.

Findings

01

XComp achieves 2x-4x more frames processed with better performance.

02

Finetuning on only 2.5% of data boosts accuracy from 42.9% to 46.2%.

03

XComp outperforms previous methods on LVBench and other benchmarks.

Abstract

Long video understanding is inherently challenging for vision-language models (VLMs) because of the extensive number of frames. With each video frame typically expanding into tens or hundreds of tokens, the limited context length of large language models (LLMs) forces the VLMs to perceive the frames sparsely and lose temporal information. To address this, we explore extreme video token compression towards one token per frame at the final LLM layer. Our key insight is that heuristic-based compression, widely adopted by previous methods, is prone to information loss, and this necessitates supervising LLM layers into learnable and progressive modules for token-level compression (LP-Comp). Such compression enables our VLM to digest 2x-4x more frames with improved performance. To further increase the token efficiency, we investigate frame-level compression, which selects the frames most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding· slideslive