AVA-AVD: Audio-Visual Speaker Diarization in the Wild

Eric Zhongcong Xu; Zeyang Song; Satoshi Tsutsui; Chao Feng; Mang Ye,; Mike Zheng Shou

arXiv:2111.14448·cs.CV·July 19, 2022

AVA-AVD: Audio-Visual Speaker Diarization in the Wild

Eric Zhongcong Xu, Zeyang Song, Satoshi Tsutsui, Chao Feng, Mang Ye,, Mike Zheng Shou

PDF

5 Repos 10 Models

TL;DR

This paper introduces the AVA-AVD dataset for in-the-wild audio-visual speaker diarization, along with a new model called AVR-Net that improves robustness and accuracy in challenging scenarios.

Contribution

The paper creates the first in-the-wild audio-visual diarization dataset and proposes AVR-Net, a novel model with a modality mask for better speaker discrimination.

Findings

01

Adding AVA-AVD improves diarization performance in wild videos.

02

AVR-Net outperforms state-of-the-art methods.

03

The model is more robust to off-screen speakers.

Abstract

Audio-visual speaker diarization aims at detecting "who spoke when" using both auditory and visual signals. Existing audio-visual diarization datasets are mainly focused on indoor environments like meeting rooms or news studios, which are quite different from in-the-wild videos in many scenarios such as movies, documentaries, and audience sitcoms. To develop diarization methods for these challenging videos, we create the AVA Audio-Visual Diarization (AVA-AVD) dataset. Our experiments demonstrate that adding AVA-AVD into training set can produce significantly better diarization models for in-the-wild videos despite that the data is relatively small. Moreover, this benchmark is challenging due to the diverse scenes, complicated acoustic conditions, and completely off-screen speakers. As a first step towards addressing the challenges, we design the Audio-Visual Relation Network (AVR-Net)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.