ViTCoP: Accelerating Large Vision-Language Models via Visual and Textual Semantic Collaborative Pruning
Wen Luo, Peng Chen, Xiaotao Huang, LiQun Huang

TL;DR
ViTCoP introduces a novel collaborative pruning framework that combines visual and textual semantic filtering to efficiently reduce redundancy in large vision-language models, leading to faster inference and lower memory usage.
Contribution
The paper proposes a new pruning method that jointly optimizes visual and textual token selection, improving efficiency without sacrificing performance in large vision-language models.
Findings
Achieves state-of-the-art results on image and video tasks.
Reduces inference latency and GPU memory consumption significantly.
Maintains high accuracy even under extreme pruning rates.
Abstract
Large Vision-Language Models (LVLMs) incur high computational costs due to significant redundancy in their visual tokens. To effectively reduce this cost, researchers have proposed various visual token pruning methods. However, existing methods are generally limited, either losing critical visual information prematurely due to pruning in the vision encoder, or leading to information redundancy among the selected tokens due to pruning in the Large Language Models (LLMs). To address these challenges, we propose a Visual and Textual Semantic Collaborative Pruning framework (ViTCoP) that combines redundancy filtering in the vision encoder with step-wise co-pruning within the LLM based on its hierarchical characteristics, to efficiently preserve critical and informationally diverse visual tokens. Meanwhile, to ensure compatibility with acceleration techniques like FlashAttention, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications
