VisionZip: Longer is Better but Not Necessary in Vision Language Models
Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia

TL;DR
VisionZip reduces redundancy in visual tokens for vision-language models, leading to faster inference and improved performance by selecting informative tokens, thus addressing inefficiencies caused by overly long visual representations.
Contribution
Introduction of VisionZip, a method that selects informative visual tokens to reduce redundancy and improve efficiency in vision-language models, applicable to image, video, and dialogue tasks.
Findings
VisionZip outperforms previous methods by at least 5% in performance.
Increases inference speed by up to 8x.
Enables faster inference for larger models with better results.
Abstract
Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform. Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5% performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training · Focus
