Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models

Omer Faruk Deniz; Ruiyu Mao; Ruochen Li; Yapeng Tian; Latifur Khan

arXiv:2602.12618·cs.CV·February 16, 2026

Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models

Omer Faruk Deniz, Ruiyu Mao, Ruochen Li, Yapeng Tian, Latifur Khan

PDF

Open Access

TL;DR

This paper introduces Attention-Driven Self-Compression (ADSC), a novel method for reducing vision tokens in multimodal large language models by leveraging the LLM's attention, significantly improving efficiency while maintaining high performance.

Contribution

The paper proposes a new token compression technique that uses the LLM's attention to guide progressive vision token reduction without auxiliary modules or attention modifications.

Findings

01

Reduces FLOPs by 53.7% and memory by 56.7% on LLaVA-1.5

02

Maintains 98.2% of original performance after compression

03

Outperforms prior pruning methods in efficiency and accuracy

Abstract

Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pruning methods operate either before the LLM, limiting generality due to diverse encoder-projector designs or within the LLM using heuristics that are incompatible with FlashAttention. We take a different approach: rather than identifying unimportant tokens, we treat the LLM itself as the optimal guide for compression. Observing that deeper layers naturally transmit vision-to-text information, we introduce Attention-Driven Self-Compression (ADSC), a simple, broadly applicable method that progressively reduces vision tokens using only the LLM's attention mechanism. Our method applies uniform token downsampling at selected layers, forming bottlenecks that encourage the model to reorganize and compress information into the remaining tokens. It…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Generative Adversarial Networks and Image Synthesis