Learning Compact Vision Tokens for Efficient Large Multimodal Models

Hao Tang; Chengchao Shen

arXiv:2506.07138·cs.CV·June 10, 2025

Learning Compact Vision Tokens for Efficient Large Multimodal Models

Hao Tang, Chengchao Shen

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel method combining Spatial Token Fusion and Multi-Block Token Fusion to significantly reduce vision token sequences in large multimodal models, enhancing inference efficiency while maintaining performance.

Contribution

The paper proposes a new approach to shorten vision token sequences using STF and MBTF modules, enabling faster inference in large multimodal models without losing reasoning ability.

Findings

01

Achieves comparable or better performance with only 25% of the original vision tokens.

02

Reduces inference time significantly while preserving multimodal reasoning.

03

Demonstrates effectiveness on 8 vision-language benchmarks.

Abstract

Large multimodal models (LMMs) suffer significant computational challenges due to the high cost of Large Language Models (LLMs) and the quadratic complexity of processing long vision token sequences. In this paper, we explore the spatial redundancy among vision tokens and shorten the length of vision token sequences for inference acceleration. Specifically, we propose a Spatial Token Fusion (STF) method to learn compact vision tokens for short vision token sequence, where spatial-adjacent tokens are fused into one. Meanwhile, weight-frozen vision encoder can not well adapt to the demand of extensive downstream vision-language tasks. To this end, we further introduce a Multi-Block Token Fusion (MBTF) module to supplement multi-granularity features for the reduced token sequence. Overall, we combine STF and MBTF module to balance token reduction and information preservation, thereby…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

visresearch/LLaVA-STF
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications