ViTCoP: Accelerating Large Vision-Language Models via Visual and Textual Semantic Collaborative Pruning

Wen Luo; Peng Chen; Xiaotao Huang; LiQun Huang

arXiv:2601.17818·cs.CV·January 27, 2026

ViTCoP: Accelerating Large Vision-Language Models via Visual and Textual Semantic Collaborative Pruning

Wen Luo, Peng Chen, Xiaotao Huang, LiQun Huang

PDF

Open Access 1 Video

TL;DR

ViTCoP introduces a novel collaborative pruning framework that combines visual and textual semantic filtering to efficiently reduce redundancy in large vision-language models, leading to faster inference and lower memory usage.

Contribution

The paper proposes a new pruning method that jointly optimizes visual and textual token selection, improving efficiency without sacrificing performance in large vision-language models.

Findings

01

Achieves state-of-the-art results on image and video tasks.

02

Reduces inference latency and GPU memory consumption significantly.

03

Maintains high accuracy even under extreme pruning rates.

Abstract

Large Vision-Language Models (LVLMs) incur high computational costs due to significant redundancy in their visual tokens. To effectively reduce this cost, researchers have proposed various visual token pruning methods. However, existing methods are generally limited, either losing critical visual information prematurely due to pruning in the vision encoder, or leading to information redundancy among the selected tokens due to pruning in the Large Language Models (LLMs). To address these challenges, we propose a Visual and Textual Semantic Collaborative Pruning framework (ViTCoP) that combines redundancy filtering in the vision encoder with step-wise co-pruning within the LLM based on its hierarchical characteristics, to efficiently preserve critical and informationally diverse visual tokens. Meanwhile, to ensure compatibility with acceleration techniques like FlashAttention, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ViTCoP: Accelerating Large Vision-Language Models via Visual and Textual Semantic Collaborative Pruning· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications