Learning Visual Voice Activity Detection with an Automatically Annotated Dataset
Sylvain Guy, St\'ephane Lathuili\`ere, Pablo Mesejo, Radu Horaud

TL;DR
This paper introduces two deep learning models for visual voice activity detection using facial landmarks and optical flow, and presents a novel method to automatically generate large, annotated in-the-wild datasets to improve training and evaluation.
Contribution
The paper proposes two new deep architectures for V-VAD and a novel automatic dataset creation method, WildVVAD, enhancing training data diversity and model robustness.
Findings
Models trained on WildVVAD outperform those trained on existing datasets.
Automatic dataset annotation improves V-VAD performance.
Deep architectures effectively leverage visual cues for voice activity detection.
Abstract
Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. V-VAD is useful whenever audio VAD (A-VAD) is inefficient either because the acoustic signal is difficult to analyze or because it is simply missing. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. Moreover, available datasets, used for learning and for testing V-VAD, lack content variability. We introduce a novel methodology to automatically create and annotate very large datasets in-the-wild -- WildVVAD -- based on combining A-VAD with face detection and tracking. A thorough empirical evaluation shows the advantage of training the proposed deep V-VAD models with this dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Video Surveillance and Tracking Methods · Face recognition and analysis
