Remote Sensing Large Vision-Language Model: Semantic-augmented Multi-level Alignment and Semantic-aware Expert Modeling
Sungjune Park, Yeongyun Kim, Se Yeon Kim, and Yong Man Ro

TL;DR
This paper introduces a specialized large vision-language model for remote sensing that uses multi-level semantic alignment and expert modeling to improve scene understanding and task performance.
Contribution
It proposes a novel framework with semantic augmentation and expert modeling tailored for remote sensing, addressing domain differences from natural images.
Findings
Achieves consistent improvements on remote sensing tasks.
Effectively bridges the gap between general LVLMs and RS-specific understanding.
Enhances multi-level semantic understanding in RS imagery.
Abstract
Large Vision and Language Models (LVLMs) have shown strong performance across various vision-language tasks in natural image domains. However, their application to remote sensing (RS) remains underexplored due to significant domain differences in visual appearances, object scales, and semantics. These discrepancies hider the effective understanding of RS scenes, which contain rich, multi-level semantic information spanning from coarse-to-fine levels. Hence, it limits the direct adaptation of existing LVLMs to RS imagery. To address this gap, we propose a novel LVLM framework tailored for RS understanding, incorporating two core components: Semantic-augmented Multi-level Alignment and Semantic-aware Expert Modeling. First, to align multi-level visual features, we introduce the retrieval-based Semantic Augmentation Module which enriches the visual features with relevant semantics across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Remote-Sensing Image Classification · Domain Adaptation and Few-Shot Learning
