GridPrune: From "Where to Look" to "What to Select" in Visual Token Pruning for MLLMs
Yuxiang Duan, Ao Li, Yingqin Li, Luyu Li, Pengwei Wang

TL;DR
GridPrune introduces a two-stage visual token pruning method inspired by human attention, improving efficiency and performance in multimodal large language models by dynamically allocating tokens across spatial zones.
Contribution
It proposes a novel zonal selection approach that considers 'where to look' before 'what to select', outperforming existing methods in token pruning for MLLMs.
Findings
Retains 96.98% of full performance with only 11.1% tokens used.
Outperforms baseline by 2.34% at the same pruning rate.
Achieves superior efficiency and accuracy across various MLLM architectures.
Abstract
Multimodal large language models (MLLMs) have shown remarkable capabilities in a wide range of vision-language tasks. However, the large number of visual tokens introduces significant computational overhead. To address this issue, visual token pruning has emerged as a key technique for enhancing the efficiency of MLLMs. In cognitive science, humans tend to first determine which regions of a scene to attend to ("where to look") before deciding which specific elements within those regions to process in detail ("what to select"). This two-stage strategy enables the visual system to efficiently allocate attention at a coarse spatial level before performing fine-grained selection. However, existing pruning methods primarily focus on directly optimizing "what to select", typically using attention scores or similarity metrics. They rarely consider "where to look", which has been shown to lead…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
