VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration

Hanxun Yu; Wentong Li; Xuan Qu; Song Wang; Junbo Chen; Jianke Zhu

arXiv:2601.22674·cs.CV·March 31, 2026

VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration

Hanxun Yu, Wentong Li, Xuan Qu, Song Wang, Junbo Chen, Jianke Zhu

PDF

1 Repo 1 Video

TL;DR

VisionTrim is a training-free framework that accelerates multimodal large language models by effectively reducing visual tokens through unified, plug-and-play modules that preserve essential information and incorporate textual guidance.

Contribution

It introduces a novel, unified, training-free approach with two plug-and-play modules for visual token reduction, improving MLLM efficiency without performance loss.

Findings

01

Outperforms existing token reduction methods on image and video benchmarks.

02

Effectively preserves critical visual information while reducing computational costs.

03

Enables practical deployment of MLLMs in real-world scenarios.

Abstract

Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC) module, which facilitates context-aware token merging guided by textual cues. Extensive experiments across diverse image and video multimodal benchmarks demonstrate the performance superiority of our VisionTrim, advancing practical MLLM deployment in real-world…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hanxunyu/VisionTrim
github

Videos

VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration· slideslive