Self-supervised Audiovisual Representation Learning for Remote Sensing Data
Konrad Heidler, Lichao Mou, Di Hu, Pu Jin, Guangyao Li, Chuang Gan,, Ji-Rong Wen, Xiao Xiang Zhu

TL;DR
This paper introduces a self-supervised method for pre-training neural networks on remote sensing data by leveraging co-located audio and imagery, resulting in improved transfer learning performance without manual annotations.
Contribution
It presents a novel self-supervised approach using audiovisual correspondence for pre-training remote sensing models, along with the new SoundingEarth dataset.
Findings
Pre-trained models outperform existing strategies in remote sensing tasks.
The approach enables label-free pre-training using audiovisual data.
Models learn meaningful scene representations across modalities.
Abstract
Many current deep learning approaches make extensive use of backbone networks pre-trained on large datasets like ImageNet, which are then fine-tuned to perform a certain task. In remote sensing, the lack of comparable large annotated datasets and the wide diversity of sensing platforms impedes similar developments. In order to contribute towards the availability of pre-trained backbone networks in remote sensing, we devise a self-supervised approach for pre-training deep neural networks. By exploiting the correspondence between geo-tagged audio recordings and remote sensing imagery, this is done in a completely label-free manner, eliminating the need for laborious manual annotation. For this purpose, we introduce the SoundingEarth dataset, which consists of co-located aerial imagery and audio samples all around the world. Using this dataset, we then pre-train ResNet models to map…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRemote-Sensing Image Classification · Speech and Audio Processing · Underwater Acoustics Research
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Convolution · Batch Normalization · Residual Connection · Average Pooling · Kaiming Initialization · 1x1 Convolution · Global Average Pooling · Residual Block · Bottleneck Residual Block
