PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid   Visual Redundancy Reduction

Long Xing; Qidong Huang; Xiaoyi Dong; Jiajie Lu; Pan Zhang; Yuhang; Zang; Yuhang Cao; Conghui He; Jiaqi Wang; Feng Wu; Dahua Lin

arXiv:2410.17247·cs.CV·February 28, 2025

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang, Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, Dahua Lin

PDF

Open Access 1 Repo

TL;DR

PyramidDrop is a novel method that reduces visual redundancy in large vision-language models by strategically dropping image tokens across layers, significantly accelerating training and inference with minimal performance loss.

Contribution

It introduces PyramidDrop, a stage-wise token dropping strategy that leverages visual redundancy to improve efficiency in LVLMs without sacrificing accuracy.

Findings

01

Achieves 40% faster training and 55% reduced inference FLOPs on LLaVA-NeXT.

02

Maintains comparable performance with significant computational savings.

03

Serves as a plug-and-play inference acceleration method.

Abstract

In large vision-language models (LVLMs), images serve as inputs that carry a wealth of information. As the idiom "A picture is worth a thousand words" implies, representing a single image in current LVLMs can require hundreds or even thousands of tokens. This results in significant computational costs, which grow quadratically as input image resolution increases, thereby severely impacting the efficiency of both training and inference. Previous approaches have attempted to reduce the number of image tokens either before or within the early layers of LVLMs. However, these strategies inevitably result in the loss of crucial image information, ultimately diminishing model performance. To address this challenge, we conduct an empirical study revealing that all visual tokens are necessary for LVLMs in the shallow layers, and token redundancy progressively increases in the deeper layers of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cooperx521/pyramiddrop
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques