The VVAD-LRS3 Dataset for Visual Voice Activity Detection
Adrian Lubitz, Matias Valdenegro-Toro, Frank Kirchner

TL;DR
This paper introduces the VVAD-LRS3 dataset, a large-scale annotated dataset for visual voice activity detection, and evaluates neural network baselines achieving high accuracy, advancing human-machine interaction capabilities.
Contribution
The paper presents a new large-scale VVAD dataset derived from LRS3, enabling improved training of neural networks for VVAD tasks.
Findings
The dataset contains over 44,000 samples, surpassing existing datasets in size.
CNN LSTM models achieved 92% accuracy on the dataset.
Humans achieved 87.93% accuracy on the same task.
Abstract
Robots are becoming everyday devices, increasing their interaction with humans. To make human-machine interaction more natural, cognitive features like Visual Voice Activity Detection (VVAD), which can detect whether a person is speaking or not, given visual input of a camera, need to be implemented. Neural networks are state of the art for tasks in Image Processing, Time Series Prediction, Natural Language Processing and other domains. Those Networks require large quantities of labeled data. Currently there are not many datasets for the task of VVAD. In this work we created a large scale dataset called the VVAD-LRS3 dataset, derived by automatic annotations from the LRS3 dataset. The VVAD-LRS3 dataset contains over 44K samples, over three times the next competitive dataset (WildVVAD). We evaluate different baselines on four kinds of features: facial and lip images, and facial and lip…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing · Facial Nerve Paralysis Treatment and Research
MethodsTest
