End to End Lip Synchronization with a Temporal AutoEncoder

Yoav Shalev; Lior Wolf

arXiv:2203.16224·cs.CV·March 31, 2022

End to End Lip Synchronization with a Temporal AutoEncoder

Yoav Shalev, Lior Wolf

PDF

Open Access 1 Repo

TL;DR

This paper presents an end-to-end neural network approach for lip synchronization in videos, using synthetic data for training and achieving superior results on multiple benchmarks.

Contribution

Introduces a dual-domain recurrent neural network trained on synthetic data for accurate lip-audio synchronization in videos.

Findings

01

Outperforms existing methods on benchmark datasets

02

Effective alignment of text-to-speech audio with videos

03

Robust performance across various video types

Abstract

We study the problem of syncing the lip movement in a video with the audio stream. Our solution finds an optimal alignment using a dual-domain recurrent neural network that is trained on synthetic data we generate by dropping and duplicating video frames. Once the alignment is found, we modify the video in order to sync the two sources. Our method is shown to greatly outperform the literature methods on a variety of existing and new benchmarks. As an application, we demonstrate our ability to robustly align text-to-speech generated audio with an existing video stream. Our code and samples are available at https://github.com/itsyoavshalev/End-to-End-Lip-Synchronization-with-a-Temporal-AutoEncoder.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

itsyoavshalev/end-to-end-lip-synchronization-with-a-temporal-autoencoder
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Music Technology and Sound Studies

MethodsALIGN