Audio Inputs for Active Speaker Detection and Localization via Microphone Array
Davide Berghi, Philip J. B. Jackson

TL;DR
This paper investigates the effectiveness of spatial acoustic features derived from multichannel microphone array audio for active speaker detection and localization using a CRNN, analyzing factors like channel number and noise robustness.
Contribution
It compares different spatial features and evaluates their robustness to noise and array configurations for active speaker detection and localization.
Findings
GCC-PHAT and SALSA features improve localization accuracy.
Performance depends on number of microphones and noise levels.
Microphone array configuration impacts detection robustness.
Abstract
This study considers the problem of detecting and locating an active talker's horizontal position from multichannel audio captured by a microphone array. We refer to this as active speaker detection and localization (ASDL). Our goal was to investigate the performance of spatial acoustic features extracted from the multichannel audio as the input of a convolutional recurrent neural network (CRNN), in relation to the number of channels employed and additive noise. To this end, experiments were conducted to compare the generalized cross-correlation with phase transform (GCC-PHAT), the spatial cue-augmented log-spectrogram (SALSA) features, and a recently-proposed beamforming method, evaluating their robustness to various noise intensities. The array aperture and sampling density were tested by taking subsets from the 16-microphone array. Results and tests of statistical significance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
