VisionZip: Longer is Better but Not Necessary in Vision Language Models

Senqiao Yang; Yukang Chen; Zhuotao Tian; Chengyao Wang; Jingyao Li; Bei Yu; Jiaya Jia

arXiv:2412.04467·cs.CV·March 17, 2026

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia

PDF

Open Access 1 Repo 2 Models 4 Datasets

TL;DR

VisionZip reduces redundancy in visual tokens for vision-language models, leading to faster inference and improved performance by selecting informative tokens, thus addressing inefficiencies caused by overly long visual representations.

Contribution

Introduction of VisionZip, a method that selects informative visual tokens to reduce redundancy and improve efficiency in vision-language models, applicable to image, video, and dialogue tasks.

Findings

01

VisionZip outperforms previous methods by at least 5% in performance.

02

Increases inference speed by up to 8x.

03

Enables faster inference for larger models with better results.

Abstract

Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform. Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5% performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dvlab-research/visionzip
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training · Focus