How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A

YiJie Huang; Yiqun Zhang; Zhuoyue Jia; Xiaocui Yang; Junzhao Huang; Zihan Wang; Shi Feng; Daling Wang; Yifei Zhang; Yongkang Liu

arXiv:2605.16359·cs.CV·May 19, 2026

How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A

YiJie Huang, Yiqun Zhang, Zhuoyue Jia, Xiaocui Yang, Junzhao Huang, Zihan Wang, Shi Feng, Daling Wang, Yifei Zhang, Yongkang Liu

PDF

TL;DR

This paper introduces F^3A, a training-free method for task-conditioned visual token pruning in multimodal models, optimizing token allocation under fixed budgets without additional training or inference overhead.

Contribution

F^3A provides a novel, training-free approach to visual token pruning that improves efficiency by task-conditioned evidence search and token allocation in multimodal models.

Findings

01

F^3A effectively reduces visual tokens without retraining.

02

It maintains model performance while lowering inference costs.

03

F^3A operates without extra model training or inference passes.

Abstract

Vision-language models improve perception by feeding increasingly long visual token sequences into language backbones, but the resulting inference cost raises a basic scaling question: as multimodal models grow, how many visual tokens are actually needed, and how should they be allocated under a fixed visual token budget? Existing training-free pruning methods typically answer this with one-shot proxies such as decoder attention, visual similarity, or conditional diversity. We argue that visual token pruning is better viewed as task-conditioned evidence search, especially under aggressive compression and across model scales. We propose F^3A, a training-free router for visual token pruning that operates before the language model consumes image tokens. F^3A builds lightweight question-conditioned cues, matches them to visual-grid tokens through frozen sparse sensing heads, and allocates a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.