TL;DR
This paper explores structured pruning methods for large vision-language models, demonstrating effective compression with minimal performance loss using limited data and lightweight recovery techniques.
Contribution
It introduces layerwise and widthwise pruning paradigms combined with finetuning and distillation, providing practical strategies for efficient LVLM compression.
Findings
Widthwise pruning outperforms in low-resource scenarios.
Finetuning only the multimodal projector suffices at small compression levels.
Effective recovery achieved with just 5% of original data.
Abstract
While Large Vision Language Models (LVLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose deployment challenges on resource-constrained edge devices. Current parameter reduction techniques primarily involve training LVLMs from small language models, but these methods offer limited flexibility and remain computationally intensive. We study a complementary route: compressing existing LVLMs by applying structured pruning to the language model backbone, followed by lightweight recovery training. Specifically, we investigate two structural pruning paradigms: layerwise and widthwise pruning, and pair them with supervised finetuning and knowledge distillation on logits and hidden states. Additionally, we assess the feasibility of conducting recovery training with only a small fraction of the available data. Our results show that widthwise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
