ObamaNet: Photo-realistic lip-sync from text
Rithesh Kumar, Jose Sotelo, Kundan Kumar, Alexandre de Brebisson,, Yoshua Bengio

TL;DR
ObamaNet is a fully trainable neural architecture that converts any text into synchronized audio and photo-realistic lip-sync videos without traditional graphics methods.
Contribution
It introduces a novel neural pipeline combining text-to-speech, keypoint generation, and video synthesis for realistic lip-sync video creation from text.
Findings
First fully trainable neural lip-sync system
Generates synchronized audio and video from text
Does not rely on traditional graphics techniques
Abstract
We present ObamaNet, the first architecture that generates both audio and synchronized photo-realistic lip-sync videos from any new text. Contrary to other published lip-sync approaches, ours is only composed of fully trainable neural modules and does not rely on any traditional computer graphics methods. More precisely, we use three main modules: a text-to-speech network based on Char2Wav, a time-delayed LSTM to generate mouth-keypoints synced to the audio, and a network based on Pix2Pix to generate the video frames conditioned on the keypoints.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Face recognition and analysis
MethodsConcatenated Skip Connection · PatchGAN · *Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · Convolution · HuMan(Expedia)||How do I get a human at Expedia? · Dropout · Pix2Pix · Sigmoid Activation · Tanh Activation
