GeoAlignCLIP: Enhancing Fine-Grained Vision-Language Alignment in Remote Sensing via Multi-Granular Consistency Learning

Xiao Yang; Ronghao Fu; Zhuoran Duan; Zhiwen Lin; Xueyan Liu; Bo Yang

arXiv:2603.09566·cs.CV·March 11, 2026

GeoAlignCLIP: Enhancing Fine-Grained Vision-Language Alignment in Remote Sensing via Multi-Granular Consistency Learning

Xiao Yang, Ronghao Fu, Zhuoran Duan, Zhiwen Lin, Xueyan Liu, Bo Yang

PDF

Open Access

TL;DR

GeoAlignCLIP introduces a multi-granular semantic alignment framework for remote sensing, significantly improving fine-grained vision-language understanding and outperforming existing methods on various benchmarks.

Contribution

It presents a novel unified framework for fine-grained alignment in remote sensing, incorporating multi-granular semantic learning and intra-modal consistency, along with a new hierarchical dataset.

Findings

01

Outperforms existing RS-specific methods across multiple benchmarks.

02

Achieves more precise visual-semantic alignment at region and concept levels.

03

Demonstrates robustness and accuracy in complex, fine-grained tasks.

Abstract

Vision-language pretraining models have made significant progress in bridging remote sensing imagery with natural language. However, existing approaches often fail to effectively integrate multi-granular visual and textual information, relying primarily on global image-text alignment. This limitation hinders the model's ability to accurately capture fine-grained details in images, thus restricting its performance in complex, fine-grained tasks. To address this, we propose GeoAlignCLIP, a unified framework that achieves fine-grained alignment in remote sensing tasks by learning multi-granular semantic alignments and incorporating intra-modal consistency, enabling more precise visual-semantic alignment between image regions and text concepts. Additionally, we construct RSFG-100k, a fine-granular remote sensing dataset containing scene descriptions, region-level annotations, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Remote-Sensing Image Classification