Learning Tri-modal Embeddings for Zero-Shot Soundscape Mapping
Subash Khanal, Srikumar Sastry, Aayush Dhakal, Nathan Jacobs

TL;DR
This paper introduces a novel tri-modal embedding approach for zero-shot soundscape mapping, effectively combining geotagged audio, textual descriptions, and location images to predict sounds at specific locations, outperforming previous methods.
Contribution
The paper presents a new contrastive pre-training method that creates a shared embedding space for three modalities, enabling accurate zero-shot soundscape predictions from textual or audio queries.
Findings
Significant improvement in image-to-audio Recall@100 from 0.256 to 0.450.
Outperforms existing state-of-the-art methods on the SoundingEarth dataset.
Enables construction of soundscape maps for any geographic region.
Abstract
We focus on the task of soundscape mapping, which involves predicting the most probable sounds that could be perceived at a particular geographic location. We utilise recent state-of-the-art models to encode geotagged audio, a textual description of the audio, and an overhead image of its capture location using contrastive pre-training. The end result is a shared embedding space for the three modalities, which enables the construction of soundscape maps for any geographic region from textual or audio queries. Using the SoundingEarth dataset, we find that our approach significantly outperforms the existing SOTA, with an improvement of image-to-audio Recall@100 from 0.256 to 0.450. Our code is available at https://github.com/mvrl/geoclap.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Noise Effects and Management
MethodsFocus
