Towards End-to-End Acoustic Localization using Deep Learning: from Audio Signal to Source Position Coordinates
Juan Manuel Vera-Diaz, Daniel Pizarro, Javier Macias-Guarasa

TL;DR
This paper introduces a CNN-based method for indoor acoustic source localization that directly estimates 3D source positions from raw audio signals, outperforming traditional techniques and demonstrating robustness to speaker gender and window size variations.
Contribution
The paper presents the first CNN architecture that directly estimates 3D source positions from raw audio, using a novel two-step training strategy with semi-synthetic and real data.
Findings
Significantly outperforms SRP-PHAT localization methods.
Exhibits better robustness to speaker gender variations.
Shows improved accuracy with different window sizes.
Abstract
This paper presents a novel approach for indoor acoustic source localization using microphone arrays and based on a Convolutional Neural Network (CNN). The proposed solution is, to the best of our knowledge, the first published work in which the CNN is designed to directly estimate the three dimensional position of an acoustic source, using the raw audio signal as the input information avoiding the use of hand crafted audio features. Given the limited amount of available localization data, we propose in this paper a training strategy based on two steps. We first train our network using semi-synthetic data, generated from close talk speech recordings, and where we simulate the time delays and distortion suffered in the signal that propagates from the source to the array of microphones. We then fine tune this network using a small amount of real data. Our experimental results show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
