CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models
Zicong Tang, Ziyang Ma, Suqing Wang, Zuchao Li, Lefei Zhang, Hai Zhao, Yun Li, Qianren Wang

TL;DR
CoViPAL introduces a layer-wise, context-aware token pruning method for large vision-language models, significantly reducing computational costs while maintaining high accuracy through a lightweight, model-agnostic module.
Contribution
It proposes a novel Plug-and-Play Pruning Module that effectively prunes redundant visual tokens across layers, outperforming existing methods without additional training.
Findings
Outperforms training-free pruning methods at the same token budget
Surpasses training-based methods with similar supervision
Enhances inference efficiency without accuracy loss
Abstract
Large Vision-Language Models (LVLMs) process multimodal inputs consisting of text tokens and vision tokens extracted from images or videos. Due to the rich visual information, a single image can generate thousands of vision tokens, leading to high computational costs during the prefilling stage and significant memory overhead during decoding. Existing methods attempt to prune redundant vision tokens, revealing substantial redundancy in visual representations. However, these methods often struggle in shallow layers due to the lack of sufficient contextual information. We argue that many visual tokens are inherently redundant even in shallow layers and can be safely and effectively pruned with appropriate contextual signals. In this work, we propose CoViPAL, a layer-wise contextualized visual token pruning method that employs a Plug-and-Play Pruning Module (PPM) to predict and remove…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
