How to Design a Three-Stage Architecture for Audio-Visual Active Speaker   Detection in the Wild

Okan K\"op\"ukl\"u; Maja Taseska; Gerhard Rigoll

arXiv:2106.03932·cs.CV·September 8, 2021

How to Design a Three-Stage Architecture for Audio-Visual Active Speaker Detection in the Wild

Okan K\"op\"ukl\"u, Maja Taseska, Gerhard Rigoll

PDF

Open Access 1 Repo

TL;DR

This paper introduces ASDNet, a three-stage architecture for audio-visual active speaker detection that outperforms previous methods on the AVA-ActiveSpeaker dataset by leveraging a structured pipeline and practical guidelines.

Contribution

The paper proposes a novel three-stage architecture, ASDNet, with practical design guidelines, achieving state-of-the-art performance in active speaker detection.

Findings

01

Achieved 93.5% mAP on AVA-ActiveSpeaker dataset.

02

Outperformed previous methods by 4.7% mAP.

03

Provided practical guidelines for designing effective audio-visual detection systems.

Abstract

Successful active speaker detection requires a three-stage pipeline: (i) audio-visual encoding for all speakers in the clip, (ii) inter-speaker relation modeling between a reference speaker and the background speakers within each frame, and (iii) temporal modeling for the reference speaker. Each stage of this pipeline plays an important role for the final performance of the created architecture. Based on a series of controlled experiments, this work presents several practical guidelines for audio-visual active speaker detection. Correspondingly, we present a new architecture called ASDNet, which achieves a new state-of-the-art on the AVA-ActiveSpeaker dataset with a mAP of 93.5% outperforming the second best with a large margin of 4.7%. Our code and pretrained models are publicly available.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

okankop/ASDNet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing