A Real-Time Active Speaker Detection System Integrating an Audio-Visual Signal with a Spatial Querying Mechanism
Ilya Gurvich, Ido Leichter, Dharmendar Reddy Palle, Yossi Asher, Alon, Vinnikov, Igor Abramovski, Vishak Gopal, Ross Cutler, Eyal Krupka

TL;DR
This paper presents a real-time, low-power neural network system for active speaker detection that integrates audio-visual data and a spatial querying mechanism, suitable for edge devices and complex meeting scenarios.
Contribution
It introduces a novel neural network that learns to query acoustic data considering head locations, with graceful degradation under limited computational budgets, and is optimized for low-power edge deployment.
Findings
Operates with 127 MFLOPs per participant in a 14-person meeting
Exhibits graceful degradation when computational budget is exhausted
Performs well in realistic, challenging meeting scenarios
Abstract
We introduce a distinctive real-time, causal, neural network-based active speaker detection system optimized for low-power edge computing. This system drives a virtual cinematography module and is deployed on a commercial device. The system uses data originating from a microphone array and a 360-degree camera. Our network requires only 127 MFLOPs per participant, for a meeting with 14 participants. Unlike previous work, we examine the error rate of our network when the computational budget is exhausted, and find that it exhibits graceful degradation, allowing the system to operate reasonably well even in this case. Departing from conventional DOA estimation approaches, our network learns to query the available acoustic data, considering the detected head locations. We train and evaluate our algorithm on a realistic meetings dataset featuring up to 14 participants in the same meeting,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
