RedundancyLens: Revealing and Exploiting Visual Token Processing Redundancy for Efficient Decoder-Only MLLMs
Hongliang Li, Jiaxin Zhang, Wenhui Liao, Dezhi Peng, Kai Ding, Lianwen Jin

TL;DR
RedundancyLens identifies and exploits visual token processing redundancies in decoder-only multimodal large language models, enabling efficient inference without retraining and maintaining high performance.
Contribution
The paper introduces a training-free framework to analyze and reduce visual token processing redundancy in decoder-only MLLMs, improving efficiency while preserving performance.
Findings
Decoder-only MLLMs exhibit significant structured redundancy in visual token processing.
Redundancy reduction can be achieved without retraining, maintaining or improving performance.
The framework accelerates inference, making decoder-only MLLMs more efficient.
Abstract
Current Multimodal Large Language Model (MLLM) architectures face a critical tradeoff between performance and efficiency: decoder-only architectures achieve higher performance but lower efficiency, while cross-attention-based architectures offer greater efficiency but lower performance. The key distinction lies in how visual tokens are processed. Decoder-only architectures apply self-attention and FFN operations on visual tokens, while cross-attention architectures skip these computations. To investigate whether redundancy exists in this computationally expensive process, we propose a training-free framework for analyzing trained MLLMs. It consists of Probe-Activated Dynamic FFN and Hollow Attention, which enable adjustable reductions in computations for visual tokens, as well as a Layer Ranking Algorithm that prioritizes layers for these reductions. Extensive experiments demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Parallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems
MethodsSoftmax · Attention Is All You Need
