[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs
Ao Wang, Fengyuan Sun, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding

TL;DR
This paper introduces VTC-CLS, a training-free visual token compression method for multimodal large language models that uses the [CLS] token's attention scores to efficiently identify important visual tokens, improving performance and reducing computational costs.
Contribution
The paper proposes a novel, training-free visual token pruning method leveraging the [CLS] token's attention scores, outperforming existing methods in efficiency and accuracy.
Findings
VTC-CLS achieves state-of-the-art performance on various tasks.
It significantly reduces computational costs without training.
The method effectively captures key visual information using [CLS] attention scores.
Abstract
Multimodal Large Language Models (MLLMs) have recently demonstrated strong performance across a wide range of vision-language tasks, garnering significant attention in the computer vision. However, their efficient deployment remains a substantial challenge due to high computational costs and memory requirements. Recognizing the redundancy of information within the vision modality, recent studies have explored methods for compressing visual tokens in MLLMs to enhance efficiency in a training-free manner. Despite their effectiveness, existing methods like Fast rely on the attention between visual tokens and prompt text tokens as the importance indicator, overlooking the relevance to response text and thus introducing perception bias. In this paper, we demonstrate that in MLLMs, the [CLS] token in the visual encoder inherently knows which visual tokens are important for MLLMs. Building on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Natural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need · Pruning
