TRIO: Token Reduction via Inference-Objective Guidance for Efficient Vision-Language Models

Haokui Zhang; Congyang Ou; Dawei Yan; Peng Wang; Qingsen Yan; Yu Zhang; Ying Li; Rong Xiao

arXiv:2602.04657·cs.CV·May 15, 2026

TRIO: Token Reduction via Inference-Objective Guidance for Efficient Vision-Language Models

Haokui Zhang, Congyang Ou, Dawei Yan, Peng Wang, Qingsen Yan, Yu Zhang, Ying Li, Rong Xiao

PDF

1 Repo

TL;DR

TRIO is a training-free method for visual token reduction in vision-language models that preserves output invariance, leading to significant speedups and efficiency improvements while maintaining high performance.

Contribution

It introduces a novel inference-objective-based approach for token compression that is compatible with existing acceleration techniques and deployment scenarios.

Findings

01

Retains 97.2% of performance with only 11.1% tokens on LLaVA-Next-7B.

02

Achieves 2.75× prefill speedup and 2.14× inference speedup.

03

Reduces FLOPs and KV Cache overhead by over 6 times.

Abstract

Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in compression performance and practical deployment. In contrast, we propose TRIO from the perspective of inference objectives, which transforms visual token compression into preserving output result invariance and selects tokens primarily by their importance to this goal. Specifically, vision tokens are reordered with the guidance of token-level gradient saliency generated by our designed layer-local proxy loss, a coarse constraint from the current layer to the final result. Then the most valuable vision tokens are selected following the non-maximum suppression (NMS)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ocy1/TRIO
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.