End-to-End Active Speaker Detection

Juan Leon Alcazar; Moritz Cordes; Chen Zhao; and Bernard Ghanem

arXiv:2203.14250·cs.CV·July 26, 2022

End-to-End Active Speaker Detection

Juan Leon Alcazar, Moritz Cordes, Chen Zhao, and Bernard Ghanem

PDF

Open Access 3 Repos

TL;DR

This paper introduces an end-to-end trainable network for active speaker detection that jointly learns features and context, incorporating interleaved graph neural network blocks and a weakly-supervised strategy, achieving state-of-the-art results.

Contribution

It presents a novel end-to-end ASD framework with interleaved GNN blocks and a weakly-supervised approach using audio data, improving performance over existing methods.

Findings

01

Achieved state-of-the-art ASD performance.

02

Demonstrated effectiveness of interleaved GNN blocks.

03

Showed viability of weakly-supervised training with audio annotations.

Abstract

Recent advances in the Active Speaker Detection (ASD) problem build upon a two-stage process: feature extraction and spatio-temporal context aggregation. In this paper, we propose an end-to-end ASD workflow where feature learning and contextual predictions are jointly learned. Our end-to-end trainable network simultaneously learns multi-modal embeddings and aggregates spatio-temporal context. This results in more suitable feature representations and improved performance in the ASD task. We also introduce interleaved graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance. Finally, we design a weakly-supervised strategy, which demonstrates that the ASD problem can also be approached…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsGraph Neural Network