TL;DR
This paper introduces the AVA-AVD dataset for in-the-wild audio-visual speaker diarization, along with a new model called AVR-Net that improves robustness and accuracy in challenging scenarios.
Contribution
The paper creates the first in-the-wild audio-visual diarization dataset and proposes AVR-Net, a novel model with a modality mask for better speaker discrimination.
Findings
Adding AVA-AVD improves diarization performance in wild videos.
AVR-Net outperforms state-of-the-art methods.
The model is more robust to off-screen speakers.
Abstract
Audio-visual speaker diarization aims at detecting "who spoke when" using both auditory and visual signals. Existing audio-visual diarization datasets are mainly focused on indoor environments like meeting rooms or news studios, which are quite different from in-the-wild videos in many scenarios such as movies, documentaries, and audience sitcoms. To develop diarization methods for these challenging videos, we create the AVA Audio-Visual Diarization (AVA-AVD) dataset. Our experiments demonstrate that adding AVA-AVD into training set can produce significantly better diarization models for in-the-wild videos despite that the data is relatively small. Moreover, this benchmark is challenging due to the diverse scenes, complicated acoustic conditions, and completely off-screen speakers. As a first step towards addressing the challenges, we design the Audio-Visual Relation Network (AVR-Net)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗pyannote/speaker-diarization-3.1model· 11.3M dl· ♡ 170911.3M dl♡ 1709
- 🤗pyannote/speaker-diarization-community-1model· 2.0M dl· ♡ 2672.0M dl♡ 267
- 🤗pyannote/speaker-diarization-3.0model· 303k dl· ♡ 214303k dl♡ 214
- 🤗G-Root/speaker-diarization-optimizedmodel· 316 dl316 dl
- 🤗collinbarnwell/pyannote-speaker-diarization-31model· 96 dl· ♡ 596 dl♡ 5
- 🤗eek/speaker-diarizationmodel· 4 dl4 dl
- 🤗tensorlake/speaker-diarization-3.1model· 159 dl· ♡ 4159 dl♡ 4
- 🤗msobroza/speaker_dia_31model· 2 dl2 dl
- 🤗fatymatariq/speaker-diarization-3.1model· 4.1k dl· ♡ 14.1k dl♡ 1
- 🤗statsmaths/diarizemodel· 2 dl2 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
