Multi-Input Multi-Output Target-Speaker Voice Activity Detection For Unified, Flexible, and Robust Audio-Visual Speaker Diarization
Ming Cheng, Ming Li

TL;DR
This paper introduces a multi-input multi-output framework for speaker diarization that integrates audio and visual data, achieving state-of-the-art results and robustness in overlapped speech and lip-missing scenarios.
Contribution
The novel MIMO-TSVAD framework enables flexible, unified audio-visual speaker diarization within a sequence-to-sequence model, addressing modality limitations and improving accuracy.
Findings
Achieves state-of-the-art DERs on VoxConverse, DIHARD-III, and MISP 2022 datasets.
Performs robustly in heavy lip-missing scenarios.
Supports audio-only, video-only, and combined diarization modes.
Abstract
Audio-visual learning has demonstrated promising results in many classical speech tasks (e.g., speech separation, automatic speech recognition, wake-word spotting). We believe that introducing visual modality will also benefit speaker diarization. To date, Target-Speaker Voice Activity Detection (TS-VAD) plays an important role in highly accurate speaker diarization. However, previous TS-VAD models take audio features and utilize the speaker's acoustic footprint to distinguish his or her personal speech activities, which is easily affected by overlapped speech in multi-speaker scenarios. Although visual information naturally tolerates overlapped speech, it suffers from spatial occlusion, low resolution, etc. The potential modality-missing problem blocks TS-VAD towards an audio-visual approach. This paper proposes a novel Multi-Input Multi-Output Target-Speaker Voice Activity Detection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
