D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning
Evelyn Zhang, Fufu Yu, Aoqi Wu, Zichen Wen, Ke Yan, Shouhong Ding, Biqing Qi, Linfeng Zhang

TL;DR
D2Pruner is a novel token pruning framework for Multimodal Large Language Models that combines debiased importance and structural diversity to significantly reduce computational costs while maintaining high performance, especially on localization tasks.
Contribution
It introduces a unique combination of debiased importance scoring and a Maximal Independent Set-based structural pruning mechanism for improved token selection.
Findings
Reduces FLOPs by 74.2% with 99.2% performance retention on general tasks.
Maintains 85.7% performance at 90% token reduction on localization benchmarks.
Outperforms existing methods with up to 63.53% improvement.
Abstract
Processing long visual token sequences poses a significant computational burden on Multimodal Large Language Models (MLLMs). While token pruning offers a path to acceleration, we find that current methods, while adequate for general understanding, catastrophically fail on fine-grained localization tasks. We attribute this failure to the inherent flaws of the two prevailing strategies: importance-based methods suffer from a strong positional bias, an inherent model artifact that distracts from semantic content, while diversity-based methods exhibit structural blindness, disregarding the user's prompt and spatial redundancy. To address this, we introduce D2Pruner, a framework that rectifies these issues by uniquely combining debiased importance with a structural pruning mechanism. Our method first secures a core set of the most critical tokens as pivots based on a debiased attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
