Multi-Perspective Subimage CLIP with Keyword Guidance for Remote Sensing Image-Text Retrieval
Yifan Li, Shiying Wang, and Jianqiang Huang

TL;DR
This paper introduces MPS-CLIP, a parameter-efficient, keyword-guided framework for remote sensing image-text retrieval that improves semantic alignment and achieves state-of-the-art results on benchmark datasets.
Contribution
MPS-CLIP shifts from global to fine-grained, keyword-guided alignment in RSITR, utilizing a lightweight adapter and multi-perspective embeddings for enhanced performance.
Findings
Achieves 35.18% and 48.40% mean Recall on RSICD and RSITMD datasets.
Outperforms full fine-tuning baselines and recent methods.
Demonstrates effective semantic matching with minimal computational overhead.
Abstract
Vision-Language Pre-training (VLP) models like CLIP have significantly advanced Remote Sensing Image-Text Retrieval (RSITR). However, existing methods predominantly rely on coarse-grained global alignment, which often overlooks the dense, multi-scale semantics inherent in overhead imagery. Moreover, adapting these heavy models via full fine-tuning incurs prohibitive computational costs and risks catastrophic forgetting. To address these challenges, we propose MPS-CLIP, a parameter-efficient framework designed to shift the retrieval paradigm from global matching to keyword-guided fine-grained alignment. Specifically, we leverage a Large Language Model (LLM) to extract core semantic keywords, guiding the Segment Anything Model (SamGeo) to generate semantically relevant sub-perspectives. To efficiently adapt the frozen backbone, we introduce a Gated Global Attention (G^2A) adapter, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Remote-Sensing Image Classification
