Speech Prediction in Silent Videos using Variational Autoencoders
Ravindra Yadav, Ashish Sardana, Vinay P Namboodiri, Rajesh M Hegde

TL;DR
This paper introduces a stochastic variational autoencoder-based model that predicts speech from silent videos by capturing the multimodal distribution of audio-visual signals, improving over deterministic approaches.
Contribution
The paper proposes a novel combination of recurrent neural networks and variational deep generative models for speech prediction in silent videos, addressing multimodality.
Findings
Effective speech prediction demonstrated on the GRID dataset.
Outperforms deterministic models by capturing full data distribution.
Shows potential for applications in accessibility and video editing.
Abstract
Understanding the relationship between the auditory and visual signals is crucial for many different applications ranging from computer-generated imagery (CGI) and video editing automation to assisting people with hearing or visual impairments. However, this is challenging since the distribution of both audio and visual modality is inherently multimodal. Therefore, most of the existing methods ignore the multimodal aspect and assume that there only exists a deterministic one-to-one mapping between the two modalities. It can lead to low-quality predictions as the model collapses to optimizing the average behavior rather than learning the full data distributions. In this paper, we present a stochastic model for generating speech in a silent video. The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory signal's conditional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
