An Efficient and Streaming Audio Visual Active Speaker Detection System
Arnav Kundu, Yanzi Jin, Mohammad Sekhavat, Max Horton, Danny Tormoen,, Devang Naik

TL;DR
This paper introduces a real-time, streaming audio-visual active speaker detection system that reduces latency and memory usage by limiting future and past context frames, achieving high performance with minimal resource demands.
Contribution
The paper proposes a novel constrained transformer-based approach for streaming ASD that effectively balances accuracy with real-time computational and memory constraints.
Findings
Constrained transformers match or outperform recurrent models in ASD accuracy.
Limiting past context has a greater impact on accuracy than limiting future context.
The architecture is memory-bound by past context size with negligible compute costs.
Abstract
This paper delves into the challenging task of Active Speaker Detection (ASD), where the system needs to determine in real-time whether a person is speaking or not in a series of video frames. While previous works have made significant strides in improving network architectures and learning effective representations for ASD, a critical gap exists in the exploration of real-time system deployment. Existing models often suffer from high latency and memory usage, rendering them impractical for immediate applications. To bridge this gap, we present two scenarios that address the key challenges posed by real-time constraints. First, we introduce a method to limit the number of future context frames utilized by the ASD model. By doing so, we alleviate the need for processing the entire sequence of future frames before a decision is made, significantly reducing latency. Second, we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Blind Source Separation Techniques · Advanced Data Compression Techniques
