Global Semantic-Guided Sub-image Feature Weight Allocation in   High-Resolution Large Vision-Language Models

Yuxuan Liang; Xu Li; Xiaolei Chen; Haotian Chen; Yi Zheng; Chenghang; Lai; Bin Li; Xiangyang Xue

arXiv:2501.14276·cs.CV·January 27, 2025

Global Semantic-Guided Sub-image Feature Weight Allocation in High-Resolution Large Vision-Language Models

Yuxuan Liang, Xu Li, Xiaolei Chen, Haotian Chen, Yi Zheng, Chenghang, Lai, Bin Li, Xiangyang Xue

PDF

Open Access

TL;DR

This paper introduces GSWA, a semantic-guided weight allocation module that dynamically emphasizes more informative sub-images in high-resolution vision-language models, improving understanding without increasing model size.

Contribution

The paper proposes the GSWA module for adaptive sub-image weighting based on semantic relevance, integrated into SleighVL, enhancing high-resolution image processing in LVLMs.

Findings

01

SleighVL outperforms comparable parameter models in accuracy.

02

GSWA improves focus on semantically rich sub-images.

03

Model remains competitive with larger models.

Abstract

As the demand for high-resolution image processing in Large Vision-Language Models (LVLMs) grows, sub-image partitioning has become a popular approach for mitigating visual information loss associated with fixed-resolution processing. However, existing partitioning methods uniformly process sub-images, resulting in suboptimal image understanding. In this work, we reveal that the sub-images with higher semantic relevance to the entire image encapsulate richer visual information for preserving the model's visual understanding ability. Therefore, we propose the Global Semantic-guided Weight Allocator (GSWA) module, which dynamically allocates weights to sub-images based on their relative information density, emulating human visual attention mechanisms. This approach enables the model to focus on more informative regions, overcoming the limitations of uniform treatment. We integrate GSWA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsSoftmax · Attention Is All You Need · Attentive Walk-Aggregating Graph Neural Network · Focus