TL;DR
This paper introduces CoMP, a joint parameter and token pruning framework for vision-language models, which improves performance at high pruning ratios by exploring redundancy in both modes.
Contribution
The paper proposes a novel collaborative importance metric and multi-mode pruning strategy that jointly prunes parameters and tokens in VLMs, outperforming existing methods.
Findings
CoMP achieves better performance at high pruning ratios.
It effectively explores redundancy in both parameters and tokens.
Source code is publicly available at https://github.com/Wuzimeng/CoMP.git.
Abstract
Vision-Language Models (VLMs) have advanced rapidly within the unified Transformer architecture, yet their deployment on resource-constrained devices remains challenging due to high computational complexity. While pruning has emerged as an effective technique for compressing VLMs, existing approaches predominantly focus on a single mode by pruning either parameters or tokens, neglecting fully exploring the inherent redundancy in each mode, which leads to substantial performance degradation at high pruning ratios. To address the above limitations, we propose Collaborative Multi-Mode Pruning (CoMP), a novel framework tailored for VLMs by performing joint parameter and token pruning. Specifically, we first design a Collaborative Importance Metric (CIM) that investigates the mutual interference between the coupled parameters and tokens. It incorporates distinct significance of tokens into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
