An Efficient and Streaming Audio Visual Active Speaker Detection System

Arnav Kundu; Yanzi Jin; Mohammad Sekhavat; Max Horton; Danny Tormoen,; Devang Naik

arXiv:2409.09018·cs.CV·September 16, 2024

An Efficient and Streaming Audio Visual Active Speaker Detection System

Arnav Kundu, Yanzi Jin, Mohammad Sekhavat, Max Horton, Danny Tormoen,, Devang Naik

PDF

Open Access

TL;DR

This paper introduces a real-time, streaming audio-visual active speaker detection system that reduces latency and memory usage by limiting future and past context frames, achieving high performance with minimal resource demands.

Contribution

The paper proposes a novel constrained transformer-based approach for streaming ASD that effectively balances accuracy with real-time computational and memory constraints.

Findings

01

Constrained transformers match or outperform recurrent models in ASD accuracy.

02

Limiting past context has a greater impact on accuracy than limiting future context.

03

The architecture is memory-bound by past context size with negligible compute costs.

Abstract

This paper delves into the challenging task of Active Speaker Detection (ASD), where the system needs to determine in real-time whether a person is speaking or not in a series of video frames. While previous works have made significant strides in improving network architectures and learning effective representations for ASD, a critical gap exists in the exploration of real-time system deployment. Existing models often suffer from high latency and memory usage, rendering them impractical for immediate applications. To bridge this gap, we present two scenarios that address the key challenges posed by real-time constraints. First, we introduce a method to limit the number of future context frames utilized by the ASD model. By doing so, we alleviate the need for processing the entire sequence of future frames before a decision is made, significantly reducing latency. Second, we propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Blind Source Separation Techniques · Advanced Data Compression Techniques