Cross-Modal Pre-Aligned Method with Global and Local Information for   Remote-Sensing Image and Text Retrieval

Zengbao Sun; Ming Zhao; Gaorui Liu; Andr\'e Kaup

arXiv:2411.14704·cs.CV·November 25, 2024

Cross-Modal Pre-Aligned Method with Global and Local Information for Remote-Sensing Image and Text Retrieval

Zengbao Sun, Ming Zhao, Gaorui Liu, Andr\'e Kaup

PDF

TL;DR

This paper introduces CMPAGL, a novel cross-modal retrieval method for remote sensing images and text that effectively integrates global and local features, pre-aligns features for better fusion, and improves retrieval accuracy through re-ranking and enhanced loss functions.

Contribution

The paper proposes CMPAGL, a new pre-aligned cross-modal retrieval approach combining multi-scale features and a re-ranking algorithm, outperforming existing methods on multiple datasets.

Findings

01

Achieves up to 4.65% improvement in R@1 over state-of-the-art.

02

Demonstrates effectiveness across four remote sensing datasets.

03

Enhances feature learning with an improved triplet loss.

Abstract

Remote sensing cross-modal text-image retrieval (RSCTIR) has gained attention for its utility in information mining. However, challenges remain in effectively integrating global and local information due to variations in remote sensing imagery and ensuring proper feature pre-alignment before modal fusion, which affects retrieval accuracy and efficiency. To address these issues, we propose CMPAGL, a cross-modal pre-aligned method leveraging global and local information. Our Gswin transformer block combines local window self-attention and global-local window cross-attention to capture multi-scale features. A pre-alignment mechanism simplifies modal fusion training, improving retrieval performance. Additionally, we introduce a similarity matrix reweighting (SMR) algorithm for reranking, and enhance the triplet loss function with an intra-class distance term to optimize feature learning.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need · Triplet Loss