[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs

Ao Wang; Fengyuan Sun; Hui Chen; Zijia Lin; Jungong Han; Guiguang Ding

arXiv:2412.05819·cs.CV·December 10, 2024

[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs

Ao Wang, Fengyuan Sun, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding

PDF

Open Access 1 Repo

TL;DR

This paper introduces VTC-CLS, a training-free visual token compression method for multimodal large language models that uses the [CLS] token's attention scores to efficiently identify important visual tokens, improving performance and reducing computational costs.

Contribution

The paper proposes a novel, training-free visual token pruning method leveraging the [CLS] token's attention scores, outperforming existing methods in efficiency and accuracy.

Findings

01

VTC-CLS achieves state-of-the-art performance on various tasks.

02

It significantly reduces computational costs without training.

03

The method effectively captures key visual information using [CLS] attention scores.

Abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated strong performance across a wide range of vision-language tasks, garnering significant attention in the computer vision. However, their efficient deployment remains a substantial challenge due to high computational costs and memory requirements. Recognizing the redundancy of information within the vision modality, recent studies have explored methods for compressing visual tokens in MLLMs to enhance efficiency in a training-free manner. Despite their effectiveness, existing methods like Fast rely on the attention between visual tokens and prompt text tokens as the importance indicator, overlooking the relevance to response text and thus introducing perception bias. In this paper, we demonstrate that in MLLMs, the [CLS] token in the visual encoder inherently knows which visual tokens are important for MLLMs. Building on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thu-mig/vtc-cls
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Natural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need · Pruning