Spatial-Aware Efficient Projector for MLLMs via Multi-Layer Feature   Aggregation

Shun Qian; Bingquan Liu; Chengjie Sun; Zhen Xu; Baoxun Wang

arXiv:2410.10319·cs.CV·October 15, 2024

Spatial-Aware Efficient Projector for MLLMs via Multi-Layer Feature Aggregation

Shun Qian, Bingquan Liu, Chengjie Sun, Zhen Xu, Baoxun Wang

PDF

Open Access

TL;DR

This paper introduces a Spatial-Aware Efficient Projector (SAEP) that significantly reduces visual tokens and enhances spatial understanding in multi-modal language models, leading to improved performance on multimodal benchmarks.

Contribution

The paper proposes a novel SAEP method employing multi-layer feature aggregation and spatial enhancement to improve efficiency and spatial understanding in MLLMs.

Findings

01

Reduces visual tokens by 75%

02

Improves multimodal spatial understanding

03

Achieves top performance on benchmarks

Abstract

The projector plays a crucial role in multi-modal language models (MLLMs). The number of visual tokens it outputs affects the efficiency of the MLLM, while the quality of the visual tokens influences the visual understanding capabilities of the MLLM. Current explorations on the projector focus on reducing the number of visual tokens to improve efficiency, often overlooking the inherent spatial discrepancy between the serialized 2-dimensional visual token sequences and natural language token sequences. A Spatial-Aware Efficient Projector (SAEP) is proposed to address this issue. In detail, our SAEP method employs an modified separable depthwise convolution module on multi-layer visual features to enhance the spatial information of visual tokens. As a result, our SAEP method can not only largely reduce the number of visual tokens by 75\%, but also significantly improve the multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization

MethodsConvolution · Focus · Depthwise Convolution