Spot the conversation: speaker diarisation in the wild

Joon Son Chung; Jaesung Huh; Arsha Nagrani; Triantafyllos Afouras,; Andrew Zisserman

arXiv:2007.01216·cs.SD·August 17, 2021

Spot the conversation: speaker diarisation in the wild

Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras,, Andrew Zisserman

PDF

TL;DR

This paper introduces an automatic audio-visual speaker diarisation method for YouTube videos, a semi-automatic annotation pipeline, and a large-scale 'in the wild' diarisation dataset called VoxConverse, facilitating research in real-world conditions.

Contribution

The paper presents a novel audio-visual diarisation approach, a semi-automatic annotation pipeline, and the creation of the VoxConverse dataset for 'in the wild' videos.

Findings

01

Effective active speaker detection in diverse conditions

02

Significant reduction in annotation effort

03

Large, challenging diarisation dataset released

Abstract

The goal of this paper is speaker diarisation of videos collected 'in the wild'. We make three key contributions. First, we propose an automatic audio-visual diarisation method for YouTube videos. Our method consists of active speaker detection using audio-visual methods and speaker verification using self-enrolled speaker models. Second, we integrate our method into a semi-automatic dataset creation pipeline which significantly reduces the number of hours required to annotate videos with diarisation labels. Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from 'in the wild' videos, which we will release publicly to the research community. Our dataset consists of overlapping speech, a large and diverse speaker pool, and challenging background conditions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.