How to Design a Three-Stage Architecture for Audio-Visual Active Speaker Detection in the Wild
Okan K\"op\"ukl\"u, Maja Taseska, Gerhard Rigoll

TL;DR
This paper introduces ASDNet, a three-stage architecture for audio-visual active speaker detection that outperforms previous methods on the AVA-ActiveSpeaker dataset by leveraging a structured pipeline and practical guidelines.
Contribution
The paper proposes a novel three-stage architecture, ASDNet, with practical design guidelines, achieving state-of-the-art performance in active speaker detection.
Findings
Achieved 93.5% mAP on AVA-ActiveSpeaker dataset.
Outperformed previous methods by 4.7% mAP.
Provided practical guidelines for designing effective audio-visual detection systems.
Abstract
Successful active speaker detection requires a three-stage pipeline: (i) audio-visual encoding for all speakers in the clip, (ii) inter-speaker relation modeling between a reference speaker and the background speakers within each frame, and (iii) temporal modeling for the reference speaker. Each stage of this pipeline plays an important role for the final performance of the created architecture. Based on a series of controlled experiments, this work presents several practical guidelines for audio-visual active speaker detection. Correspondingly, we present a new architecture called ASDNet, which achieves a new state-of-the-art on the AVA-ActiveSpeaker dataset with a mAP of 93.5% outperforming the second best with a large margin of 4.7%. Our code and pretrained models are publicly available.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
