Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval
Rui Yang, Shuang Wang, Yingping Han, Yuanheng Li, Dong Zhao, Dou Quan,, Yanhe Guo, Licheng Jiao

TL;DR
This paper introduces a novel multi-scale alignment method for remote sensing image-text retrieval that improves the learning of joint representations by aligning image and text features at multiple scales separately.
Contribution
It proposes a multi-scale cross-modal transformer and semantic alignment losses to enhance retrieval performance in remote sensing applications.
Findings
Outperforms existing state-of-the-art methods across multiple datasets
Demonstrates effectiveness with various visual backbones
Validates the importance of multi-scale alignment in RSITR
Abstract
Remote Sensing Image-Text Retrieval (RSITR) is pivotal for knowledge services and data mining in the remote sensing (RS) domain. Considering the multi-scale representations in image content and text vocabulary can enable the models to learn richer representations and enhance retrieval. Current multi-scale RSITR approaches typically align multi-scale fused image features with text features, but overlook aligning image-text pairs at distinct scales separately. This oversight restricts their ability to learn joint representations suitable for effective retrieval. We introduce a novel Multi-Scale Alignment (MSA) method to overcome this limitation. Our method comprises three key innovations: (1) Multi-scale Cross-Modal Alignment Transformer (MSCMAT), which computes cross-attention between single-scale image features and localized text features, integrating global textual context to derive a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Geographic Information Systems Studies
MethodsLinear Layer · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections
