Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models
Omer Faruk Deniz, Ruiyu Mao, Ruochen Li, Yapeng Tian, Latifur Khan

TL;DR
This paper introduces Attention-Driven Self-Compression (ADSC), a novel method for reducing vision tokens in multimodal large language models by leveraging the LLM's attention, significantly improving efficiency while maintaining high performance.
Contribution
The paper proposes a new token compression technique that uses the LLM's attention to guide progressive vision token reduction without auxiliary modules or attention modifications.
Findings
Reduces FLOPs by 53.7% and memory by 56.7% on LLaVA-1.5
Maintains 98.2% of original performance after compression
Outperforms prior pruning methods in efficiency and accuracy
Abstract
Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pruning methods operate either before the LLM, limiting generality due to diverse encoder-projector designs or within the LLM using heuristics that are incompatible with FlashAttention. We take a different approach: rather than identifying unimportant tokens, we treat the LLM itself as the optimal guide for compression. Observing that deeper layers naturally transmit vision-to-text information, we introduce Attention-Driven Self-Compression (ADSC), a simple, broadly applicable method that progressively reduces vision tokens using only the LLM's attention mechanism. Our method applies uniform token downsampling at selected layers, forming bottlenecks that encourage the model to reorganize and compress information into the remaining tokens. It…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Generative Adversarial Networks and Image Synthesis
