End-to-End Active Speaker Detection
Juan Leon Alcazar, Moritz Cordes, Chen Zhao, and Bernard Ghanem

TL;DR
This paper introduces an end-to-end trainable network for active speaker detection that jointly learns features and context, incorporating interleaved graph neural network blocks and a weakly-supervised strategy, achieving state-of-the-art results.
Contribution
It presents a novel end-to-end ASD framework with interleaved GNN blocks and a weakly-supervised approach using audio data, improving performance over existing methods.
Findings
Achieved state-of-the-art ASD performance.
Demonstrated effectiveness of interleaved GNN blocks.
Showed viability of weakly-supervised training with audio annotations.
Abstract
Recent advances in the Active Speaker Detection (ASD) problem build upon a two-stage process: feature extraction and spatio-temporal context aggregation. In this paper, we propose an end-to-end ASD workflow where feature learning and contextual predictions are jointly learned. Our end-to-end trainable network simultaneously learns multi-modal embeddings and aggregates spatio-temporal context. This results in more suitable feature representations and improved performance in the ASD task. We also introduce interleaved graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance. Finally, we design a weakly-supervised strategy, which demonstrates that the ASD problem can also be approached…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsGraph Neural Network
