TL;DR
CLASP is a flexible token reduction framework for multimodal large language models that uses class-adaptive layer fusion and dual-stage pruning to improve efficiency and robustness.
Contribution
It introduces a novel class-adaptive pruning method with multi-layer feature fusion and dual-stage token selection, outperforming existing approaches.
Findings
CLASP achieves superior performance across various benchmarks.
It effectively reduces visual tokens while maintaining accuracy.
The method is robust under diverse instructions and architectures.
Abstract
Multimodal Large Language Models (MLLMs) suffer from substantial computational overhead due to the high redundancy in visual token sequences. Existing approaches typically address this issue using single-layer Vision Transformer (ViT) features and static pruning strategies. However, such fixed configurations are often brittle under diverse instructions. To overcome these limitations, we propose CLASP, a plug-and-play token reduction framework based on class-adaptive layer fusion and dual-stage pruning. Specifically, CLASP first constructs category-specific visual representations through multi-layer vision feature fusion. It then performs dual-stage pruning, allocating the token budget between attention-salient pivot tokens for relevance and redundancy-aware completion tokens for coverage. Through class-adaptive pruning, CLASP enables prompt-conditioned feature fusion and budget…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
