Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild
Sindhu B Hegde, K R Prajwal, Rudrabha Mukhopadhyay, Vinay P, Namboodiri, C. V. Jawahar

TL;DR
This paper introduces a novel lip-to-speech synthesis method that generates speech for any speaker in the wild without fixed speaker constraints, using a VAE-GAN architecture to handle stochastic variations.
Contribution
It presents a flexible, speaker-agnostic lip-to-speech model that works on wild videos and can be fine-tuned for specific identities, outperforming previous approaches.
Findings
Outperforms all baseline methods significantly.
Can be fine-tuned for specific speakers with limited data.
Effective on multiple datasets with qualitative and quantitative results.
Abstract
In this work, we address the problem of generating speech from silent lip videos for any speaker in the wild. In stark contrast to previous works, our method (i) is not restricted to a fixed number of speakers, (ii) does not explicitly impose constraints on the domain or the vocabulary and (iii) deals with videos that are recorded in the wild as opposed to within laboratory settings. The task presents a host of challenges, with the key one being that many features of the desired target speech, like voice, pitch and linguistic content, cannot be entirely inferred from the silent face video. In order to handle these stochastic variations, we propose a new VAE-GAN architecture that learns to associate the lip and speech sequences amidst the variations. With the help of multiple powerful discriminators that guide the training process, our generator learns to synthesize speech sequences in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
