TL;DR
This paper introduces a novel asymmetric multimodal feature matching network (AMFMN) for cross-modal remote sensing image retrieval, addressing multi-scale features and redundancy issues to improve accuracy on a newly constructed fine-grained dataset.
Contribution
The paper proposes a new AMFMN model with multi-scale visual self-attention and a dynamic triplet loss, along with a challenging fine-grained RSITMD dataset for improved retrieval performance.
Findings
Achieves state-of-the-art results on four RS text-image datasets.
Effectively filters redundant features and handles multi-scale visual information.
Demonstrates robustness on a newly constructed fine-grained dataset.
Abstract
Remote sensing (RS) cross-modal text-image retrieval has attracted extensive attention for its advantages of flexible input and efficient query. However, traditional methods ignore the characteristics of multi-scale and redundant targets in RS image, leading to the degradation of retrieval accuracy. To cope with the problem of multi-scale scarcity and target redundancy in RS multimodal retrieval task, we come up with a novel asymmetric multimodal feature matching network (AMFMN). Our model adapts to multi-scale feature inputs, favors multi-source retrieval methods, and can dynamically filter redundant features. AMFMN employs the multi-scale visual self-attention (MVSA) module to extract the salient features of RS image and utilizes visual features to guide the text representation. Furthermore, to alleviate the positive samples ambiguity caused by the strong intraclass similarity in RS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTriplet Loss
