Quality-Aware End-to-End Audio-Visual Neural Speaker Diarization

Mao-Kui He; Jun Du; Shu-Tong Niu; Qing-Feng Liu; Chin-Hui Lee

arXiv:2410.22350·cs.MM·October 31, 2024

Quality-Aware End-to-End Audio-Visual Neural Speaker Diarization

Mao-Kui He, Jun Du, Shu-Tong Niu, Qing-Feng Liu, Chin-Hui Lee

PDF

Open Access

TL;DR

This paper introduces a comprehensive audio-visual speaker diarization system that effectively handles overlapping speech and signal degradations using a multi-modal, quality-aware, end-to-end neural framework with cross attention mechanisms.

Contribution

It presents a novel end-to-end audio-visual diarization model that incorporates quality-aware fusion and cross attention to improve robustness in challenging conditions.

Findings

01

Achieves high accuracy even with degraded video quality.

02

Effectively handles overlapping speech scenarios.

03

Demonstrates robustness across diverse acoustic environments.

Abstract

In this paper, we propose a quality-aware end-to-end audio-visual neural speaker diarization framework, which comprises three key techniques. First, our audio-visual model takes both audio and visual features as inputs, utilizing a series of binary classification output layers to simultaneously identify the activities of all speakers. This end-to-end framework is meticulously designed to effectively handle situations of overlapping speech, providing accurate discrimination between speech and non-speech segments through the utilization of multi-modal information. Next, we employ a quality-aware audio-visual fusion structure to address signal quality issues for both audio degradations, such as noise, reverberation and other distortions, and video degradations, such as occlusions, off-screen speakers, or unreliable detection. Finally, a cross attention mechanism applied to multi-speaker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing