Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More

Zichen Wen; Yifeng Gao; Shaobo Wang; Junyuan Zhang; Qintong Zhang; Weijia Li; Conghui He; Linfeng Zhang

arXiv:2502.11494·cs.CL·June 10, 2025

Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More

Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, Linfeng Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces DART, a novel token pruning method for multimodal language models that focuses on token duplication rather than importance, achieving significant speed-ups with minimal performance loss.

Contribution

DART is a duplication-aware token pruning method that outperforms importance-based pruning, enabling efficient inference in multimodal models without training.

Findings

01

Prunes 88.9% vision tokens with comparable performance

02

Achieves 1.99× speed-up in total time

03

Achieves 2.99× speed-up in prefill stage

Abstract

Vision tokens in multimodal large language models often dominate huge computational overhead due to their excessive length compared to linguistic modality. Abundant recent methods aim to solve this problem with token pruning, which first defines an importance criterion for tokens and then prunes the unimportant vision tokens during inference. However, in this paper, we show that the importance is not an ideal indicator to decide whether a token should be pruned. Surprisingly, it usually results in inferior performance than random token pruning and leading to incompatibility to efficient attention computation operators.Instead, we propose DART (Duplication-Aware Reduction of Tokens), which prunes tokens based on its duplication with other tokens, leading to significant and training-free acceleration. Concretely, DART selects a small subset of pivot tokens and then retains the tokens with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zichenwen1/dart
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsSoftmax · Attention Is All You Need · Difficulty-Aware Rejection Tuning · Pruning