wav2pos: Sound Source Localization using Masked Autoencoders

Axel Berg; Jens Gulin; Mark O'Connor; Chuteng Zhou; Karl {\AA}str\"om,; Magnus Oskarsson

arXiv:2408.15771·eess.AS·December 17, 2024

wav2pos: Sound Source Localization using Masked Autoencoders

Axel Berg, Jens Gulin, Mark O'Connor, Chuteng Zhou, Karl {\AA}str\"om,, Magnus Oskarsson

PDF

Open Access 1 Repo

TL;DR

This paper introduces wav2pos, a novel masked autoencoder-based method for 3D sound source localization that effectively handles variable microphone setups and missing data, demonstrating competitive results in indoor environments.

Contribution

Proposes a flexible, set-to-set regression approach using masked autoencoders for accurate 3D sound localization with arbitrary microphone configurations.

Findings

01

Achieves accurate localization in simulated and real-world recordings.

02

Handles missing microphone data effectively.

03

Performs competitively against classical and learning-based methods.

Abstract

We present a novel approach to the 3D sound source localization task for distributed ad-hoc microphone arrays by formulating it as a set-to-set regression problem. By training a multi-modal masked autoencoder model that operates on audio recordings and microphone coordinates, we show that such a formulation allows for accurate localization of the sound source, by reconstructing coordinates masked in the input. Our approach is flexible in the sense that a single model can be used with an arbitrary number of microphones, even when a subset of audio recordings and microphone coordinates are missing. We test our method on simulated and real-world recordings of music and speech in indoor environments, and demonstrate competitive performance compared to both classical and other learning based localization methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

axeber01/wav2pos
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis