Late Audio-Visual Fusion for In-The-Wild Speaker Diarization

Zexu Pan; Gordon Wichern; Fran\c{c}ois G. Germain; Aswin Subramanian,; Jonathan Le Roux

arXiv:2211.01299·eess.AS·September 28, 2023·1 cites

Late Audio-Visual Fusion for In-The-Wild Speaker Diarization

Zexu Pan, Gordon Wichern, Fran\c{c}ois G. Germain, Aswin Subramanian,, Jonathan Le Roux

PDF

Open Access

TL;DR

This paper introduces a late fusion audio-visual speaker diarization system designed for challenging in-the-wild videos, combining improved audio and visual modules to outperform existing methods on benchmark datasets.

Contribution

It proposes a novel late fusion approach integrating an enhanced audio diarization system with a visual-centric module for in-the-wild videos, achieving state-of-the-art results.

Findings

01

Surpasses state-of-the-art on AVA-AVD benchmark

02

Improved audio diarization with EEND-EDA++ using attention and speaker recognition loss

03

Effective visual module leveraging facial attributes and lip-audio synchrony

Abstract

Speaker diarization is well studied for constrained audios but little explored for challenging in-the-wild videos, which have more speakers, shorter utterances, and inconsistent on-screen speakers. We address this gap by proposing an audio-visual diarization model which combines audio-only and visual-centric sub-systems via late fusion. For audio, we show that an attractor-based end-to-end system (EEND-EDA) performs remarkably well when trained with our proposed recipe of a simulated proxy dataset, and propose an improved version, EEND-EDA++, that uses attention in decoding and a speaker recognition loss during training to better handle the larger number of speakers. The visual-centric sub-system leverages facial attributes and lip-audio synchrony for identity and speech activity estimation of on-screen speakers. Both sub-systems surpass the state of the art (SOTA) by a large margin,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation