PuMer: Pruning and Merging Tokens for Efficient Vision Language Models

Qingqing Cao; Bhargavi Paranjape; Hannaneh Hajishirzi

arXiv:2305.17530·cs.CV·May 30, 2023·1 cites

PuMer: Pruning and Merging Tokens for Efficient Vision Language Models

Qingqing Cao, Bhargavi Paranjape, Hannaneh Hajishirzi

PDF

Open Access 1 Repo

TL;DR

PuMer is a framework that reduces tokens in vision language models through pruning and merging, significantly speeding up inference and decreasing memory use with minimal accuracy loss.

Contribution

PuMer introduces a novel token reduction method using text-informed pruning and modality-aware merging, enhancing efficiency of vision language models.

Findings

01

Inference throughput increased by up to 2x.

02

Memory footprint reduced by over 50%.

03

Accuracy drop less than 1%.

Abstract

Large-scale vision language (VL) models use Transformers to perform cross-modal interactions between the input text and image. These cross-modal interactions are computationally expensive and memory-intensive due to the quadratic complexity of processing the input image and text. We present PuMer: a token reduction framework that uses text-informed Pruning and modality-aware Merging strategies to progressively reduce the tokens of input image and text, improving model inference speed and reducing memory footprint. PuMer learns to keep salient image tokens related to the input text and merges similar textual and visual tokens by adding lightweight token reducer modules at several cross-modal layers in the VL model. Training PuMer is mostly the same as finetuning the original VL model but faster. Our evaluation for two vision language models on four downstream VL tasks shows PuMer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

csarron/pumer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques

MethodsPruning · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings