EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling

Jiafei Song; Fengwei Zhou; Jin Qu; Wenjin Jason Li; Tong Wu; Gengjian Xue; Zhikang Zhao; Daomin Wei; Yichao Lu; Bailin Na

arXiv:2604.17087·cs.CV·April 21, 2026

EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling

Jiafei Song, Fengwei Zhou, Jin Qu, Wenjin Jason Li, Tong Wu, Gengjian Xue, Zhikang Zhao, Daomin Wei, Yichao Lu, Bailin Na

PDF

TL;DR

EvoComp is a novel visual token compression framework for multimodal large language models that reduces tokens efficiently while maintaining high accuracy, leading to faster inference especially on mobile devices.

Contribution

It introduces a lightweight transformer-based compressor trained with an evolutionary labeling strategy and specialized loss functions to effectively select informative visual tokens.

Findings

01

Retains 99.3% of original accuracy with 3x token compression.

02

Achieves up to 1.6x speedup on mobile devices.

03

Outperforms existing attention or similarity-based methods.

Abstract

Recent Multimodal Large Language Models (MLLMs) have demonstrated strong performance on vision-language understanding tasks, yet their inference efficiency is often hampered by the large number of visual tokens, particularly in high-resolution or multi-image scenarios. To address this issue, we propose EvoComp, a visual token compression framework that significantly reduces token count while preserving task accuracy. EvoComp introduces a lightweight encoder-only transformer-based compressor that selects the most informative and non-redundant visual tokens by jointly considering visual and textual contexts. A core challenge lies in providing effective supervision for training the compressor. To this end, we design an evolutionary labeling strategy that searches for token subsets minimizing the MLLM's output loss, while enforcing semantic diversity through vocabulary-based token grouping.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.