Global Semantic-Guided Sub-image Feature Weight Allocation in High-Resolution Large Vision-Language Models
Yuxuan Liang, Xu Li, Xiaolei Chen, Haotian Chen, Yi Zheng, Chenghang, Lai, Bin Li, Xiangyang Xue

TL;DR
This paper introduces GSWA, a semantic-guided weight allocation module that dynamically emphasizes more informative sub-images in high-resolution vision-language models, improving understanding without increasing model size.
Contribution
The paper proposes the GSWA module for adaptive sub-image weighting based on semantic relevance, integrated into SleighVL, enhancing high-resolution image processing in LVLMs.
Findings
SleighVL outperforms comparable parameter models in accuracy.
GSWA improves focus on semantically rich sub-images.
Model remains competitive with larger models.
Abstract
As the demand for high-resolution image processing in Large Vision-Language Models (LVLMs) grows, sub-image partitioning has become a popular approach for mitigating visual information loss associated with fixed-resolution processing. However, existing partitioning methods uniformly process sub-images, resulting in suboptimal image understanding. In this work, we reveal that the sub-images with higher semantic relevance to the entire image encapsulate richer visual information for preserving the model's visual understanding ability. Therefore, we propose the Global Semantic-guided Weight Allocator (GSWA) module, which dynamically allocates weights to sub-images based on their relative information density, emulating human visual attention mechanisms. This approach enables the model to focus on more informative regions, overcoming the limitations of uniform treatment. We integrate GSWA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsSoftmax · Attention Is All You Need · Attentive Walk-Aggregating Graph Neural Network · Focus
