Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos
Alexander Waibel, Moritz Behr, Fevziye Irem Eyiokur, Dogucan, Yaman, Tuan-Nam Nguyen, Carlos Mullov, Mehmet Arif Demirtas and, Alperen Kantarc{\i}, Stefan Constantin, Haz{\i}m Kemal Ekenel

TL;DR
This paper introduces an end-to-end neural system that translates videos into different languages while maintaining lip synchronization and voice characteristics of the original speaker.
Contribution
It presents a novel integrated pipeline combining speech recognition, translation, voice conversion, and lip synchronization using GANs for realistic video translation.
Findings
System produces lip-synchronous, voice-preserving translated videos.
User study confirms realism and effectiveness of the translation.
Collected dataset supports future research in video translation.
Abstract
In this paper, we propose a neural end-to-end system for voice preserving, lip-synchronous translation of videos. The system is designed to combine multiple component models and produces a video of the original speaker speaking in the target language that is lip-synchronous with the target speech, yet maintains emphases in speech, voice characteristics, face video of the original speaker. The pipeline starts with automatic speech recognition including emphasis detection, followed by a translation model. The translated text is then synthesized by a Text-to-Speech model that recreates the original emphases mapped from the original sentence. The resulting synthetic voice is then mapped back to the original speakers' voice using a voice conversion model. Finally, to synchronize the lips of the speaker with the translated audio, a conditional generative adversarial network-based model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis
MethodsTest
