RedundancyLens: Revealing and Exploiting Visual Token Processing Redundancy for Efficient Decoder-Only MLLMs

Hongliang Li; Jiaxin Zhang; Wenhui Liao; Dezhi Peng; Kai Ding; Lianwen Jin

arXiv:2501.19036·cs.CV·June 2, 2025

RedundancyLens: Revealing and Exploiting Visual Token Processing Redundancy for Efficient Decoder-Only MLLMs

Hongliang Li, Jiaxin Zhang, Wenhui Liao, Dezhi Peng, Kai Ding, Lianwen Jin

PDF

Open Access

TL;DR

RedundancyLens identifies and exploits visual token processing redundancies in decoder-only multimodal large language models, enabling efficient inference without retraining and maintaining high performance.

Contribution

The paper introduces a training-free framework to analyze and reduce visual token processing redundancy in decoder-only MLLMs, improving efficiency while preserving performance.

Findings

01

Decoder-only MLLMs exhibit significant structured redundancy in visual token processing.

02

Redundancy reduction can be achieved without retraining, maintaining or improving performance.

03

The framework accelerates inference, making decoder-only MLLMs more efficient.

Abstract

Current Multimodal Large Language Model (MLLM) architectures face a critical tradeoff between performance and efficiency: decoder-only architectures achieve higher performance but lower efficiency, while cross-attention-based architectures offer greater efficiency but lower performance. The key distinction lies in how visual tokens are processed. Decoder-only architectures apply self-attention and FFN operations on visual tokens, while cross-attention architectures skip these computations. To investigate whether redundancy exists in this computationally expensive process, we propose a training-free framework for analyzing trained MLLMs. It consists of Probe-Activated Dynamic FFN and Hollow Attention, which enable adjustable reductions in computations for visual tokens, as well as a Layer Ranking Algorithm that prioritizes layers for these reductions. Extensive experiments demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Parallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems

MethodsSoftmax · Attention Is All You Need