METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models

Yuchen Liu; Yaoming Wang; Bowen Shi; Xiaopeng Zhang; Wenrui Dai; Chenglin Li; Hongkai Xiong; Qi Tian

arXiv:2507.20842·cs.CV·July 29, 2025

METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models

Yuchen Liu, Yaoming Wang, Bowen Shi, Xiaopeng Zhang, Wenrui Dai, Chenglin Li, Hongkai Xiong, Qi Tian

PDF

TL;DR

METEOR introduces a multi-stage token pruning framework for multi-encoder vision-language models, significantly reducing computational load while maintaining high performance across diverse benchmarks.

Contribution

It is the first to propose a multi-stage token pruning approach for multi-encoder vision-language models, enhancing efficiency without sacrificing accuracy.

Findings

01

Reduces 76% visual tokens with only 0.3% performance drop

02

Achieves state-of-the-art efficiency on 11 benchmarks

03

Demonstrates effectiveness of collaborative token pruning strategies

Abstract

Vision encoders serve as the cornerstone of multimodal understanding. Single-encoder architectures like CLIP exhibit inherent constraints in generalizing across diverse multimodal tasks, while recent multi-encoder fusion methods introduce prohibitive computational overhead to achieve superior performance using complementary visual representations from multiple vision encoders. To address this, we propose a progressive pruning framework, namely Multi-Encoder collaboraTivE tOken pRuning (METEOR), that eliminates redundant visual tokens across the encoding, fusion, and decoding stages for multi-encoder MLLMs. For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collaborative token assignment strategy. Subsequently, for multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.