Talking Face Generation with Multilingual TTS
Hyoung-Kyu Song, Sang Hoon Woo, Junhyeok Lee, Seungmin Yang, Hyunjae, Cho, Youseong Lee, Dongho Choi, Kang-wook Kim

TL;DR
This paper introduces a system that generates realistic multilingual talking face videos from text, maintaining speaker identity and lip synchronization, with applications in translation and dubbing.
Contribution
It presents a novel joint system combining multilingual TTS and talking face generation, capable of producing synchronized videos in multiple languages from text input.
Findings
Successfully generates multilingual talking face videos in four languages.
Maintains speaker vocal identity and lip synchronization across languages.
Demonstrates generalization to multiple language families.
Abstract
In this work, we propose a joint system combining a talking face generation system with a text-to-speech system that can generate multilingual talking face videos from only the text input. Our system can synthesize natural multilingual speeches while maintaining the vocal identity of the speaker, as well as lip movements synchronized to the synthesized speech. We demonstrate the generalization capabilities of our system by selecting four languages (Korean, English, Japanese, and Chinese) each from a different language family. We also compare the outputs of our talking face generation model to outputs of a prior work that claims multilingual support. For our demo, we add a translation API to the preprocessing stage and present it in the form of a neural dubber so that users can utilize the multilingual property of our system more easily.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
