LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan

TL;DR
LLaVA-PruMerge introduces an adaptive token reduction method that significantly decreases visual token count in large multimodal models, maintaining performance while improving computational efficiency.
Contribution
It proposes PruMerge, a novel dynamic token reduction strategy that leverages attention sparsity and clustering to efficiently compress visual tokens in LMMs.
Findings
Achieves 14x token reduction on LLaVA-1.5
Maintains performance in visual question-answering tasks
Reduces computational costs quadratically with token count
Abstract
Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model. LMMs typically take in a fixed and large amount of visual tokens, such as the penultimate layer features in the CLIP visual encoder, as the prefix content. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which further increases the number of visual tokens significantly. However, due to the inherent design of the Transformer architecture, the computational costs of these models tend to increase quadratically with the number of input tokens. To tackle this problem, we explore a token reduction mechanism that identifies significant spatial redundancy among visual tokens. In response, we propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The proposed method provides a new solution towards multimodal LLM efficiency in the direction of input compression. The proposed method is able to provide computation reduction without significant finetuning. 2. Experiments show the proposed method is effective, which can maintains performance using only 5.5% of original tokens in LLaVA-1.5 across six benchmarks. It can also reduces Video-LLaVA's tokens from 2048 to 256 while preserving or improving performance. 3. The proposed method does
1. The experiments are too limited to show the generalization ability of the proposed approach. The method is primarily validated on LLaVA-1.5 and Video-LLaVA. Additional experiments on other LMMs like Flamingo, Qwen, etc. would better show the method's generalizability across different MLLM modeling architectures like different image encoder and different feature fusion mechanism. 2. The experiment section missed real world efficiency analysis. The results on Tesla V100 GPU (e.g. in Table 2) d
1) The method effectively chooses the most informative tokens from the visual encoder without further finetuning, hence largely reducing the computation costs of MLLM. 2) PruMerge outperforms former SOTA methods on token pruning for ViTs.
Instead of being an adaptive reduction method for MLLM, PruMerge seems more like a strategy for the transformer-based visual encoder. It makes no exploration of the design of the LLM, particularly, ignoring the interaction between visual tokens and text tokens in MMLM. If so, authors may consider exploring PriMerge on more visual encoders, e.g., SigLIP widely used in MLLMs and etc. Also, as authors claim that token relevance matters more for MLLM encoder, the paper may support it with more exp
- High Visual Token Pruning Rate: The proposed method achieves a significant reduction in computational cost for Multimodal Large Language Models (MLLMs) by pruning a high proportion of visual tokens. This makes the model more efficient in resource-constrained settings. - Inference Speed and Memory Efficiency: By pruning visual tokens before they are fed into the LLM, the approach reduces both inference time and memory usage. This aspect is particularly advantageous for practical applications wh
- Potential Loss of Auxiliary Visual Information: The aggressive pruning of visual tokens could lead to a significant loss of auxiliary information, which is not evident from experiments on the VQA dataset alone. This likely contributes to weaker performance on the POPE dataset. Given the request-agnostic nature of pruning method , it may only preserve the primary content, limiting its broader applicability. Testing the method on tasks requiring richer image information, like image captioning, w
1.Enhanced Computational Efficiency: PruMerge reduces visual tokens, decreasing computational complexity and speeding up large multimodal models, making them more suitable for resource-constrained environments. 2.Adaptive Selection Mechanism: By leveraging sparse attention, the method retains essential visual information while reducing computation, allowing for flexibility based on the complexity of input images.
1.A key limitation of this method is that, by trimming visual tokens, it risks losing essential image details. This makes it less suited for tasks that rely heavily on fine-grained visual information, like OCR, object detection, or fine-grained classification. Since these tasks require precise visual cues to perform well, reducing tokens could hurt performance. Another issue is that the paper doesn't test on these detail-sensitive tasks, leaving some doubt about the method's broader applicabilit
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Multi-Head Attention · Softmax · Dropout
