Learning Visual Voice Activity Detection with an Automatically Annotated   Dataset

Sylvain Guy; St\'ephane Lathuili\`ere; Pablo Mesejo; Radu Horaud

arXiv:2009.11204·cs.CV·October 19, 2020

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Sylvain Guy, St\'ephane Lathuili\`ere, Pablo Mesejo, Radu Horaud

PDF

Open Access

TL;DR

This paper introduces two deep learning models for visual voice activity detection using facial landmarks and optical flow, and presents a novel method to automatically generate large, annotated in-the-wild datasets to improve training and evaluation.

Contribution

The paper proposes two new deep architectures for V-VAD and a novel automatic dataset creation method, WildVVAD, enhancing training data diversity and model robustness.

Findings

01

Models trained on WildVVAD outperform those trained on existing datasets.

02

Automatic dataset annotation improves V-VAD performance.

03

Deep architectures effectively leverage visual cues for voice activity detection.

Abstract

Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. V-VAD is useful whenever audio VAD (A-VAD) is inefficient either because the acoustic signal is difficult to analyze or because it is simply missing. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. Moreover, available datasets, used for learning and for testing V-VAD, lack content variability. We introduce a novel methodology to automatically create and annotate very large datasets in-the-wild -- WildVVAD -- based on combining A-VAD with face detection and tracking. A thorough empirical evaluation shows the advantage of training the proposed deep V-VAD models with this dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Video Surveillance and Tracking Methods · Face recognition and analysis