Spatio-temporal Latent Representations for the Analysis of Acoustic Scenes in-the-wild
Claudia Montero-Ram\'irez, Esther Rituerto-Gonz\'alez, Carmen, Pel\'aez-Moreno

TL;DR
This paper introduces a novel self-supervised learning approach to extract spatio-temporal acoustic scene representations from in-the-wild audio data, enabling differentiation of environments like indoor and subway settings.
Contribution
It presents a new method combining acoustic embeddings, NLP-inspired algorithms, and VAEs to characterize and visualize acoustic scenes in real-world data.
Findings
Distinct acoustic scenes can be identified in the latent space.
Indoor and subway environments show clear separation in the embeddings.
The approach effectively captures spatio-temporal variations in in-the-wild audio.
Abstract
In the field of acoustic scene analysis, this paper presents a novel approach to find spatio-temporal latent representations from in-the-wild audio data. By using WE-LIVE, an in-house collected dataset that includes audio recordings in diverse real-world environments together with sparse GPS coordinates, self-annotated emotional and situational labels, we tackle the challenging task of associating each audio segment with its corresponding location as a pretext task, with the final aim of acoustically detecting violent (anomalous) contexts, left as further work. By generating acoustic embeddings and using the self-supervised learning paradigm, we aim to use the model-generated latent space to acoustically characterize the spatio-temporal context. We use YAMNet, an acoustic events classifier trained in AudioSet to temporally locate and identify acoustic events in WE-LIVE. In order to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Music Technology and Sound Studies
Methodsnode2vec · Greedy Policy Search
