TL;DR
TRIO is a training-free method for visual token reduction in vision-language models that preserves output invariance, leading to significant speedups and efficiency improvements while maintaining high performance.
Contribution
It introduces a novel inference-objective-based approach for token compression that is compatible with existing acceleration techniques and deployment scenarios.
Findings
Retains 97.2% of performance with only 11.1% tokens on LLaVA-Next-7B.
Achieves 2.75× prefill speedup and 2.14× inference speedup.
Reduces FLOPs and KV Cache overhead by over 6 times.
Abstract
Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in compression performance and practical deployment. In contrast, we propose TRIO from the perspective of inference objectives, which transforms visual token compression into preserving output result invariance and selects tokens primarily by their importance to this goal. Specifically, vision tokens are reordered with the guidance of token-level gradient saliency generated by our designed layer-local proxy loss, a coarse constraint from the current layer to the final result. Then the most valuable vision tokens are selected following the non-maximum suppression (NMS)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
