PuMer: Pruning and Merging Tokens for Efficient Vision Language Models
Qingqing Cao, Bhargavi Paranjape, Hannaneh Hajishirzi

TL;DR
PuMer is a framework that reduces tokens in vision language models through pruning and merging, significantly speeding up inference and decreasing memory use with minimal accuracy loss.
Contribution
PuMer introduces a novel token reduction method using text-informed pruning and modality-aware merging, enhancing efficiency of vision language models.
Findings
Inference throughput increased by up to 2x.
Memory footprint reduced by over 50%.
Accuracy drop less than 1%.
Abstract
Large-scale vision language (VL) models use Transformers to perform cross-modal interactions between the input text and image. These cross-modal interactions are computationally expensive and memory-intensive due to the quadratic complexity of processing the input image and text. We present PuMer: a token reduction framework that uses text-informed Pruning and modality-aware Merging strategies to progressively reduce the tokens of input image and text, improving model inference speed and reducing memory footprint. PuMer learns to keep salient image tokens related to the input text and merges similar textual and visual tokens by adding lightweight token reducer modules at several cross-modal layers in the VL model. Training PuMer is mostly the same as finetuning the original VL model but faster. Our evaluation for two vision language models on four downstream VL tasks shows PuMer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques
MethodsPruning · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
