Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models

Tianfan Peng; Yuntao Du; Pengzhou Ji; Shijie Dong; Kailin Jiang; Mingchuan Ma; Yijun Tian; Jinhe Bi; Qian Li; Wei Du; Feng Xiao; Lizhen Cui

arXiv:2511.02650·cs.CV·November 18, 2025

Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models

Tianfan Peng, Yuntao Du, Pengzhou Ji, Shijie Dong, Kailin Jiang, Mingchuan Ma, Yijun Tian, Jinhe Bi, Qian Li, Wei Du, Feng Xiao, Lizhen Cui

PDF

Open Access

TL;DR

This paper introduces UniPruneBench, a comprehensive benchmark for evaluating visual token pruning methods in large multimodal models, addressing the need for standardized assessment of efficiency and accuracy trade-offs.

Contribution

It provides a unified, extensible benchmark with standardized protocols, covering multiple datasets, models, and metrics for visual token pruning evaluation.

Findings

01

Random pruning is a surprisingly strong baseline.

02

No single pruning method outperforms others across all scenarios.

03

Pruning sensitivity varies across tasks, especially OCR.

Abstract

Large multimodal models (LMMs) often suffer from severe inference inefficiency due to the large number of visual tokens introduced by image encoders. While recent token compression methods, such as pruning and merging, have shown promise in reducing redundancy, their evaluation remains fragmented and inconsistent. In this work, we present UniPruneBench, a unified and extensible benchmark for visual token pruning in multimodal LLMs. UniPruneBench provides standardized protocols across six ability dimensions and ten datasets, covering ten representative compression algorithms and three families of LMMs (LLaVA-v1.5, Intern-VL3, and Qwen2.5-VL). Beyond task accuracy, it incorporates system-level metrics such as runtime and prefilling latency to provide a holistic view. Our experiments uncover several key findings: (1) random pruning is a surprisingly strong baseline, (2) no single method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Neural Network Applications