Visually Supervised Speaker Detection and Localization via Microphone   Array

Davide Berghi; Adrian Hilton; Philip J. B. Jackson

arXiv:2203.03291·eess.AS·March 8, 2022

Visually Supervised Speaker Detection and Localization via Microphone Array

Davide Berghi, Adrian Hilton, Philip J. B. Jackson

PDF

Open Access

TL;DR

This paper presents a microphone array-based audio CNN approach for active speaker detection and localization that outperforms visual-only methods by directly regressing speaker position from audio signals.

Contribution

It introduces a novel audio-visual pipeline using weak labels from a visual teacher to train an audio network for speaker localization.

Findings

01

Significantly outperforms baseline methods.

02

Achieves high accuracy in speaker localization.

03

Improves speech activity detection performance.

Abstract

Active speaker detection (ASD) is a multi-modal task that aims to identify who, if anyone, is speaking from a set of candidates. Current audio-visual approaches for ASD typically rely on visually pre-extracted face tracks (sequences of consecutive face crops) and the respective monaural audio. However, their recall rate is often low as only the visible faces are included in the set of candidates. Monaural audio may successfully detect the presence of speech activity but fails in localizing the speaker due to the lack of spatial cues. Our solution extends the audio front-end using a microphone array. We train an audio convolutional neural network (CNN) in combination with beamforming techniques to regress the speaker's horizontal position directly in the video frames. We propose to generate weak labels using a pre-trained active speaker detector on pre-extracted face tracks. Our pipeline…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis