Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

Hao Jiang; Calvin Murdock; Vamsi Krishna Ithapu

arXiv:2201.01928·cs.CV·January 7, 2022

Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

Hao Jiang, Calvin Murdock, Vamsi Krishna Ithapu

PDF

Open Access

TL;DR

This paper introduces a novel deep learning method for egocentric audio-visual active speaker localization that accurately detects and localizes speakers in complex, noisy environments using video and multi-channel audio.

Contribution

It presents an end-to-end deep learning approach capable of localizing speakers from all directions, including outside the camera view, and detecting the wearer's own voice activity, outperforming previous methods.

Findings

01

Superior localization accuracy in challenging conditions

02

Real-time processing capability

03

Robustness against noise and visual clutter

Abstract

Augmented reality devices have the potential to enhance human perception and enable other assistive functionalities in complex conversational environments. Effectively capturing the audio-visual context necessary for understanding these social interactions first requires detecting and localizing the voice activities of the device wearer and the surrounding people. These tasks are challenging due to their egocentric nature: the wearer's head motion may cause motion blur, surrounding people may appear in difficult viewing angles, and there may be occlusions, visual clutter, audio noise, and bad lighting. Under these conditions, previous state-of-the-art active speaker detection methods do not give satisfactory results. Instead, we tackle the problem from a new setting using both video and multi-channel microphone array audio. We propose a novel end-to-end deep learning approach that is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Indoor and Outdoor Localization Technologies · Advanced Adaptive Filtering Techniques