CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models

Zicong Tang; Ziyang Ma; Suqing Wang; Zuchao Li; Lefei Zhang; Hai Zhao; Yun Li; Qianren Wang

arXiv:2508.17243·cs.CV·September 3, 2025

CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models

Zicong Tang, Ziyang Ma, Suqing Wang, Zuchao Li, Lefei Zhang, Hai Zhao, Yun Li, Qianren Wang

PDF

Open Access 1 Video

TL;DR

CoViPAL introduces a layer-wise, context-aware token pruning method for large vision-language models, significantly reducing computational costs while maintaining high accuracy through a lightweight, model-agnostic module.

Contribution

It proposes a novel Plug-and-Play Pruning Module that effectively prunes redundant visual tokens across layers, outperforming existing methods without additional training.

Findings

01

Outperforms training-free pruning methods at the same token budget

02

Surpasses training-based methods with similar supervision

03

Enhances inference efficiency without accuracy loss

Abstract

Large Vision-Language Models (LVLMs) process multimodal inputs consisting of text tokens and vision tokens extracted from images or videos. Due to the rich visual information, a single image can generate thousands of vision tokens, leading to high computational costs during the prefilling stage and significant memory overhead during decoding. Existing methods attempt to prune redundant vision tokens, revealing substantial redundancy in visual representations. However, these methods often struggle in shallow layers due to the lack of sufficient contextual information. We argue that many visual tokens are inherently redundant even in shallow layers and can be safely and effectively pruned with appropriate contextual signals. In this work, we propose CoViPAL, a layer-wise contextualized visual token pruning method that employs a Plug-and-Play Pruning Module (PPM) to predict and remove…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning