Visually Supervised Speaker Detection and Localization via Microphone Array
Davide Berghi, Adrian Hilton, Philip J. B. Jackson

TL;DR
This paper presents a microphone array-based audio CNN approach for active speaker detection and localization that outperforms visual-only methods by directly regressing speaker position from audio signals.
Contribution
It introduces a novel audio-visual pipeline using weak labels from a visual teacher to train an audio network for speaker localization.
Findings
Significantly outperforms baseline methods.
Achieves high accuracy in speaker localization.
Improves speech activity detection performance.
Abstract
Active speaker detection (ASD) is a multi-modal task that aims to identify who, if anyone, is speaking from a set of candidates. Current audio-visual approaches for ASD typically rely on visually pre-extracted face tracks (sequences of consecutive face crops) and the respective monaural audio. However, their recall rate is often low as only the visible faces are included in the set of candidates. Monaural audio may successfully detect the presence of speech activity but fails in localizing the speaker due to the lack of spatial cues. Our solution extends the audio front-end using a microphone array. We train an audio convolutional neural network (CNN) in combination with beamforming techniques to regress the speaker's horizontal position directly in the video frames. We propose to generate weak labels using a pre-trained active speaker detector on pre-extracted face tracks. Our pipeline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
