Spatio-temporal Latent Representations for the Analysis of Acoustic   Scenes in-the-wild

Claudia Montero-Ram\'irez; Esther Rituerto-Gonz\'alez; Carmen; Pel\'aez-Moreno

arXiv:2412.07648·eess.AS·December 11, 2024

Spatio-temporal Latent Representations for the Analysis of Acoustic Scenes in-the-wild

Claudia Montero-Ram\'irez, Esther Rituerto-Gonz\'alez, Carmen, Pel\'aez-Moreno

PDF

Open Access

TL;DR

This paper introduces a novel self-supervised learning approach to extract spatio-temporal acoustic scene representations from in-the-wild audio data, enabling differentiation of environments like indoor and subway settings.

Contribution

It presents a new method combining acoustic embeddings, NLP-inspired algorithms, and VAEs to characterize and visualize acoustic scenes in real-world data.

Findings

01

Distinct acoustic scenes can be identified in the latent space.

02

Indoor and subway environments show clear separation in the embeddings.

03

The approach effectively captures spatio-temporal variations in in-the-wild audio.

Abstract

In the field of acoustic scene analysis, this paper presents a novel approach to find spatio-temporal latent representations from in-the-wild audio data. By using WE-LIVE, an in-house collected dataset that includes audio recordings in diverse real-world environments together with sparse GPS coordinates, self-annotated emotional and situational labels, we tackle the challenging task of associating each audio segment with its corresponding location as a pretext task, with the final aim of acoustically detecting violent (anomalous) contexts, left as further work. By generating acoustic embeddings and using the self-supervised learning paradigm, we aim to use the model-generated latent space to acoustically characterize the spatio-temporal context. We use YAMNet, an acoustic events classifier trained in AudioSet to temporally locate and identify acoustic events in WE-LIVE. In order to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Music Technology and Sound Studies

Methodsnode2vec · Greedy Policy Search