Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration
Yuhang Han, Xuyang Liu, Zihan Zhang, Pengxiang Ding, Junjie Chen, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang

TL;DR
This paper introduces a training-free token reduction framework for Multimodal Large Language Models that significantly accelerates processing by filtering and recycling visual tokens, achieving high efficiency with minimal performance loss.
Contribution
The paper presents the novel FiCoCo framework, combining redundancy-based token filtering and correlation-based information recycling, to accelerate MLLMs without retraining.
Findings
Up to 14.7x FLOPs reduction with 93.6% performance retention
Outperforms state-of-the-art training-free methods
Effective across various model architectures and tasks
Abstract
The quadratic complexity of Multimodal Large Language Models (MLLMs) with respect to context length poses significant computational and memory challenges, hindering their real-world deployment. In the paper, we devise a ''filter-correlate-compress'' framework to accelerate the MLLM by systematically optimizing multimodal context length during prefilling. The framework first implements FiCoCo-V, a training-free method operating within the vision encoder. It employs a redundancy-based token discard mechanism that uses a novel integrated metric to accurately filter out redundant visual tokens. To mitigate information loss, the framework introduces a correlation-based information recycling mechanism that allows preserved tokens to selectively recycle information from correlated discarded tokens with a self-preserving compression, thereby preventing the dilution of their own core content.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
