Look Who's Talking: Active Speaker Detection in the Wild

You Jin Kim; Hee-Soo Heo; Soyeon Choe; Soo-Whan Chung; Yoohwan Kwon,; Bong-Jin Lee; Youngki Kwon; Joon Son Chung

arXiv:2108.07640·cs.CV·August 18, 2021·1 cites

Look Who's Talking: Active Speaker Detection in the Wild

You Jin Kim, Hee-Soo Heo, Soyeon Choe, Soo-Whan Chung, Yoohwan Kwon,, Bong-Jin Lee, Youngki Kwon, Joon Son Chung

PDF

Open Access 1 Repo

TL;DR

This paper introduces the Active Speakers in the Wild (ASW) dataset, a new resource for evaluating active speaker detection in natural settings, and assesses baseline systems on this dataset.

Contribution

The paper presents a novel dataset for active speaker detection in natural environments and provides baseline evaluations for future research.

Findings

01

Baseline systems achieve moderate performance on ASW.

02

Dubbed videos negatively impact training effectiveness.

03

The dataset enables evaluation of active speaker detection in real-world scenarios.

Abstract

In this work, we present a novel audio-visual dataset for active speaker detection in the wild. A speaker is considered active when his or her face is visible and the voice is audible simultaneously. Although active speaker detection is a crucial pre-processing step for many audio-visual tasks, there is no existing dataset of natural human speech to evaluate the performance of active speaker detection. We therefore curate the Active Speakers in the Wild (ASW) dataset which contains videos and co-occurring speech segments with dense speech activity labels. Videos and timestamps of audible segments are parsed and adopted from VoxConverse, an existing speaker diarisation dataset that consists of videos in the wild. Face tracks are extracted from the videos and active segments are annotated based on the timestamps of VoxConverse in a semi-automatic way. Two reference systems, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

clovaai/lookwhostalking
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis