Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild

Sindhu B Hegde; K R Prajwal; Rudrabha Mukhopadhyay; Vinay P; Namboodiri; C. V. Jawahar

arXiv:2209.00642·cs.CV·September 2, 2022

Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild

Sindhu B Hegde, K R Prajwal, Rudrabha Mukhopadhyay, Vinay P, Namboodiri, C. V. Jawahar

PDF

TL;DR

This paper introduces a novel lip-to-speech synthesis method that generates speech for any speaker in the wild without fixed speaker constraints, using a VAE-GAN architecture to handle stochastic variations.

Contribution

It presents a flexible, speaker-agnostic lip-to-speech model that works on wild videos and can be fine-tuned for specific identities, outperforming previous approaches.

Findings

01

Outperforms all baseline methods significantly.

02

Can be fine-tuned for specific speakers with limited data.

03

Effective on multiple datasets with qualitative and quantitative results.

Abstract

In this work, we address the problem of generating speech from silent lip videos for any speaker in the wild. In stark contrast to previous works, our method (i) is not restricted to a fixed number of speakers, (ii) does not explicitly impose constraints on the domain or the vocabulary and (iii) deals with videos that are recorded in the wild as opposed to within laboratory settings. The task presents a host of challenges, with the key one being that many features of the desired target speech, like voice, pitch and linguistic content, cannot be entirely inferred from the silent face video. In order to handle these stochastic variations, we propose a new VAE-GAN architecture that learns to associate the lip and speech sequences amidst the variations. With the help of multiple powerful discriminators that guide the training process, our generator learns to synthesize speech sequences in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.