A Real-Time Active Speaker Detection System Integrating an Audio-Visual   Signal with a Spatial Querying Mechanism

Ilya Gurvich; Ido Leichter; Dharmendar Reddy Palle; Yossi Asher; Alon; Vinnikov; Igor Abramovski; Vishak Gopal; Ross Cutler; Eyal Krupka

arXiv:2309.08295·eess.AS·September 18, 2023·1 cites

A Real-Time Active Speaker Detection System Integrating an Audio-Visual Signal with a Spatial Querying Mechanism

Ilya Gurvich, Ido Leichter, Dharmendar Reddy Palle, Yossi Asher, Alon, Vinnikov, Igor Abramovski, Vishak Gopal, Ross Cutler, Eyal Krupka

PDF

Open Access

TL;DR

This paper presents a real-time, low-power neural network system for active speaker detection that integrates audio-visual data and a spatial querying mechanism, suitable for edge devices and complex meeting scenarios.

Contribution

It introduces a novel neural network that learns to query acoustic data considering head locations, with graceful degradation under limited computational budgets, and is optimized for low-power edge deployment.

Findings

01

Operates with 127 MFLOPs per participant in a 14-person meeting

02

Exhibits graceful degradation when computational budget is exhausted

03

Performs well in realistic, challenging meeting scenarios

Abstract

We introduce a distinctive real-time, causal, neural network-based active speaker detection system optimized for low-power edge computing. This system drives a virtual cinematography module and is deployed on a commercial device. The system uses data originating from a microphone array and a 360-degree camera. Our network requires only 127 MFLOPs per participant, for a meeting with 14 participants. Unlike previous work, we examine the error rate of our network when the computational budget is exhausted, and find that it exhibits graceful degradation, allowing the system to operate reasonably well even in this case. Departing from conventional DOA estimation approaches, our network learns to query the available acoustic data, considering the detected head locations. We train and evaluate our algorithm on a realistic meetings dataset featuring up to 14 participants in the same meeting,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis