Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration

Yuhang Han; Xuyang Liu; Zihan Zhang; Pengxiang Ding; Junjie Chen; Donglin Wang; Honggang Chen; Qingsen Yan; Siteng Huang

arXiv:2411.17686·cs.CV·November 18, 2025

Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration

Yuhang Han, Xuyang Liu, Zihan Zhang, Pengxiang Ding, Junjie Chen, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang

PDF

Open Access 1 Video

TL;DR

This paper introduces a training-free token reduction framework for Multimodal Large Language Models that significantly accelerates processing by filtering and recycling visual tokens, achieving high efficiency with minimal performance loss.

Contribution

The paper presents the novel FiCoCo framework, combining redundancy-based token filtering and correlation-based information recycling, to accelerate MLLMs without retraining.

Findings

01

Up to 14.7x FLOPs reduction with 93.6% performance retention

02

Outperforms state-of-the-art training-free methods

03

Effective across various model architectures and tasks

Abstract

The quadratic complexity of Multimodal Large Language Models (MLLMs) with respect to context length poses significant computational and memory challenges, hindering their real-world deployment. In the paper, we devise a ''filter-correlate-compress'' framework to accelerate the MLLM by systematically optimizing multimodal context length during prefilling. The framework first implements FiCoCo-V, a training-free method operating within the vision encoder. It employs a redundancy-based token discard mechanism that uses a novel integrated metric to accurately filter out redundant visual tokens. To mitigate information loss, the framework introduces a correlation-based information recycling mechanism that allows preserved tokens to selectively recycle information from correlated discarded tokens with a self-preserving compression, thereby preventing the dilution of their own core content.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration· underline

Taxonomy

TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings