Late Audio-Visual Fusion for In-The-Wild Speaker Diarization
Zexu Pan, Gordon Wichern, Fran\c{c}ois G. Germain, Aswin Subramanian,, Jonathan Le Roux

TL;DR
This paper introduces a late fusion audio-visual speaker diarization system designed for challenging in-the-wild videos, combining improved audio and visual modules to outperform existing methods on benchmark datasets.
Contribution
It proposes a novel late fusion approach integrating an enhanced audio diarization system with a visual-centric module for in-the-wild videos, achieving state-of-the-art results.
Findings
Surpasses state-of-the-art on AVA-AVD benchmark
Improved audio diarization with EEND-EDA++ using attention and speaker recognition loss
Effective visual module leveraging facial attributes and lip-audio synchrony
Abstract
Speaker diarization is well studied for constrained audios but little explored for challenging in-the-wild videos, which have more speakers, shorter utterances, and inconsistent on-screen speakers. We address this gap by proposing an audio-visual diarization model which combines audio-only and visual-centric sub-systems via late fusion. For audio, we show that an attractor-based end-to-end system (EEND-EDA) performs remarkably well when trained with our proposed recipe of a simulated proxy dataset, and propose an improved version, EEND-EDA++, that uses attention in decoding and a speaker recognition loss during training to better handle the larger number of speakers. The visual-centric sub-system leverages facial attributes and lip-audio synchrony for identity and speech activity estimation of on-screen speakers. Both sub-systems surpass the state of the art (SOTA) by a large margin,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation
