End to End Lip Synchronization with a Temporal AutoEncoder
Yoav Shalev, Lior Wolf

TL;DR
This paper presents an end-to-end neural network approach for lip synchronization in videos, using synthetic data for training and achieving superior results on multiple benchmarks.
Contribution
Introduces a dual-domain recurrent neural network trained on synthetic data for accurate lip-audio synchronization in videos.
Findings
Outperforms existing methods on benchmark datasets
Effective alignment of text-to-speech audio with videos
Robust performance across various video types
Abstract
We study the problem of syncing the lip movement in a video with the audio stream. Our solution finds an optimal alignment using a dual-domain recurrent neural network that is trained on synthetic data we generate by dropping and duplicating video frames. Once the alignment is found, we modify the video in order to sync the two sources. Our method is shown to greatly outperform the literature methods on a variety of existing and new benchmarks. As an application, we demonstrate our ability to robustly align text-to-speech generated audio with an existing video stream. Our code and samples are available at https://github.com/itsyoavshalev/End-to-End-Lip-Synchronization-with-a-Temporal-AutoEncoder.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Music Technology and Sound Studies
MethodsALIGN
