Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning

Bonan li; Zicheng Zhang; Songhua Liu; Weihao Yu; Xinchao Wang

arXiv:2505.11945·cs.CV·May 23, 2025

Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning

Bonan li, Zicheng Zhang, Songhua Liu, Weihao Yu, Xinchao Wang

PDF

Open Access

TL;DR

This paper introduces LLaVA-Meteor, a novel vision token projection method that compresses visual tokens efficiently, maintaining accuracy while significantly reducing computational costs in visual instruction tuning.

Contribution

The paper proposes a Top-Down Compression paradigm with a Flash Global Fusion module and a Visual-Native Selection mechanism to improve efficiency in vision-language models.

Findings

01

Reduces visual tokens by 75-95% with maintained or improved performance.

02

Achieves state-of-the-art results across 12 benchmarks.

03

Enhances vision modeling with local-to-single scanning and selective token assessment.

Abstract

Visual instruction tuning aims to enable large language models to comprehend the visual world, with a pivotal challenge lying in establishing an effective vision-to-language projection. However, existing methods often grapple with the intractable trade-off between accuracy and efficiency. In this paper, we present LLaVA-Meteor, a novel approach designed to break this deadlock, equipped with a novel Top-Down Compression paradigm that strategically compresses visual tokens without compromising core information. Specifically, we construct a trainable Flash Global Fusion module based on efficient selective state space operators, which aligns the feature space while enabling each token to perceive holistic visual context and instruction preference at low cost. Furthermore, a local-to-single scanning manner is employed to effectively capture local dependencies, thereby enhancing the model's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Neural Network Applications