Rethinking Token Reduction for Large Vision-Language Models

Yi Wang; Haofei Zhang; Qihan Huang; Anda Cao; Gongfan Fang; Wei Wang; Xuan Jin; Jie Song; Mingli Song; Xinchao Wang

arXiv:2603.21701·cs.CV·March 24, 2026

Rethinking Token Reduction for Large Vision-Language Models

Yi Wang, Haofei Zhang, Qihan Huang, Anda Cao, Gongfan Fang, Wei Wang, Xuan Jin, Jie Song, Mingli Song, Xinchao Wang

PDF

Open Access

TL;DR

This paper introduces MetaCompress, a learning-based token reduction method for multi-turn vision-language tasks, improving efficiency without sacrificing accuracy across various models and dialogue turns.

Contribution

The paper proposes a novel, data-efficient, learning-based token compression approach that outperforms heuristic methods in multi-turn VQA scenarios.

Findings

01

MetaCompress achieves better efficiency-accuracy trade-offs.

02

It generalizes well across different LVLM architectures.

03

It maintains performance across multiple dialogue turns.

Abstract

Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the more practical multi-turn VQA (MT-VQA) scenario largely unexplored. MT-VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbitrary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt-dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt-agnostic ones, which, though technically applicable to multi-turn settings, rely on heuristic reduction metrics such as attention scores, leading to suboptimal performance. In this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis