Semantics-Consistent Representation Learning for Remote Sensing Image-Voice Retrieval
Hailong Ning, Bin Zhao, and Yuan Yuan

TL;DR
This paper introduces a semantics-consistent representation learning method for remote sensing image-voice retrieval, effectively integrating intra- and inter-modality relationships to enhance cross-modal semantic matching.
Contribution
The novel SCRL method considers pairwise, intra-modality, and non-paired inter-modality relationships simultaneously, improving semantic consistency in RS image-voice retrieval.
Findings
Outperforms existing methods on three RS datasets
Effectively narrows the semantic gap between images and voices
Enhances retrieval accuracy through comprehensive relationship modeling
Abstract
With the development of earth observation technology, massive amounts of remote sensing (RS) images are acquired. To find useful information from these images, cross-modal RS image-voice retrieval provides a new insight. This paper aims to study the task of RS image-voice retrieval so as to search effective information from massive amounts of RS data. Existing methods for RS image-voice retrieval rely primarily on the pairwise relationship to narrow the heterogeneous semantic gap between images and voices. However, apart from the pairwise relationship included in the datasets, the intra-modality and non-paired inter-modality relationships should also be taken into account simultaneously, since the semantic consistency among non-paired representations plays an important role in the RS image-voice retrieval task. Inspired by this, a semantics-consistent representation learning (SCRL)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications
