Co-Localization of Audio Sources in Images Using Binaural Features and Locally-Linear Regression
Antoine Deleforge, Radu Horaud, Yoav Schechner, Laurent Girin

TL;DR
This paper introduces a supervised binaural audio localization method that efficiently localizes multiple sources without source separation, adaptable to variable sounds, and capable of audio-visual fusion, validated on a new real-room dataset.
Contribution
It presents a locally-linear Gaussian regression model for binaural source localization that works with variable-length sounds and enables audio-visual mapping, improving accuracy and speed.
Findings
Enhanced localization accuracy over state-of-the-art methods
Effective for speech and white noise sources
Supports real-time audio-visual applications
Abstract
This paper addresses the problem of localizing audio sources using binaural measurements. We propose a supervised formulation that simultaneously localizes multiple sources at different locations. The approach is intrinsically efficient because, contrary to prior work, it relies neither on source separation, nor on monaural segregation. The method starts with a training stage that establishes a locally-linear Gaussian regression model between the directional coordinates of all the sources and the auditory features extracted from binaural measurements. While fixed-length wide-spectrum sounds (white noise) are used for training to reliably estimate the model parameters, we show that the testing (localization) can be extended to variable-length sparse-spectrum sounds (such as speech), thus enabling a wide range of realistic applications. Indeed, we demonstrate that the method can be used…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
