Learning Tri-modal Embeddings for Zero-Shot Soundscape Mapping

Subash Khanal; Srikumar Sastry; Aayush Dhakal; Nathan Jacobs

arXiv:2309.10667·cs.CV·September 20, 2023·2 cites

Learning Tri-modal Embeddings for Zero-Shot Soundscape Mapping

Subash Khanal, Srikumar Sastry, Aayush Dhakal, Nathan Jacobs

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel tri-modal embedding approach for zero-shot soundscape mapping, effectively combining geotagged audio, textual descriptions, and location images to predict sounds at specific locations, outperforming previous methods.

Contribution

The paper presents a new contrastive pre-training method that creates a shared embedding space for three modalities, enabling accurate zero-shot soundscape predictions from textual or audio queries.

Findings

01

Significant improvement in image-to-audio Recall@100 from 0.256 to 0.450.

02

Outperforms existing state-of-the-art methods on the SoundingEarth dataset.

03

Enables construction of soundscape maps for any geographic region.

Abstract

We focus on the task of soundscape mapping, which involves predicting the most probable sounds that could be perceived at a particular geographic location. We utilise recent state-of-the-art models to encode geotagged audio, a textual description of the audio, and an overhead image of its capture location using contrastive pre-training. The end result is a shared embedding space for the three modalities, which enables the construction of soundscape maps for any geographic region from textual or audio queries. Using the SoundingEarth dataset, we find that our approach significantly outperforms the existing SOTA, with an improvement of image-to-audio Recall@100 from 0.256 to 0.450. Our code is available at https://github.com/mvrl/geoclap.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mvrl/geoclap
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Noise Effects and Management

MethodsFocus