PVC: Progressive Visual Token Compression for Unified Image and Video   Processing in Large Vision-Language Models

Chenyu Yang; Xuan Dong; Xizhou Zhu; Weijie Su; Jiahao Wang; Hao Tian,; Zhe Chen; Wenhai Wang; Lewei Lu; Jifeng Dai

arXiv:2412.09613·cs.CV·December 13, 2024

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

Chenyu Yang, Xuan Dong, Xizhou Zhu, Weijie Su, Jiahao Wang, Hao Tian,, Zhe Chen, Wenhai Wang, Lewei Lu, Jifeng Dai

PDF

Open Access 1 Models

TL;DR

This paper introduces PVC, a unified token compression method for large vision-language models that efficiently processes both images and videos by progressively encoding and adaptively compressing visual tokens, achieving state-of-the-art results.

Contribution

The paper proposes a novel unified token compression strategy called PVC that handles images and videos simultaneously, enhancing model versatility and efficiency.

Findings

01

Achieves state-of-the-art performance on various video benchmarks.

02

Maintains high performance on image benchmarks, especially in detail-sensitive tasks.

03

Efficiently compresses visual tokens with limited tokens per frame.

Abstract

Large Vision-Language Models (VLMs) have been extended to understand both images and videos. Visual token compression is leveraged to reduce the considerable token length of visual inputs. To meet the needs of different tasks, existing high-performance models usually process images and videos separately with different token compression strategies, limiting the capabilities of combining images and videos. To this end, we extend each image into a "static" video and introduce a unified token compression strategy called Progressive Visual Token Compression (PVC), where the tokens of each frame are progressively encoded and adaptively compressed to supplement the information not extracted from previous frames. Video tokens are efficiently compressed with exploiting the inherent temporal redundancy. Images are repeated as static videos, and the spatial details can be gradually supplemented in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
OpenGVLab/PVC-InternVL2-8B
model· 23 dl· ♡ 9
23 dl♡ 9

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Computer Graphics and Visualization Techniques · Advanced Image and Video Retrieval Techniques