MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset
Kim Sung-Bin, Lee Chae-Yeon, Gihun Son, Oh Hyun-Bin, Janghoon Ju,, Suekyeong Nam, Tae-Hyun Oh

TL;DR
This paper introduces MultiTalk, a model for generating 3D talking heads across multiple languages, supported by a new multilingual video dataset, improving lip-sync accuracy in diverse linguistic contexts.
Contribution
The work presents a new multilingual dataset and a model that incorporates language-specific style embeddings to enhance 3D talking head generation across languages.
Findings
Significant improvement in multilingual lip-sync accuracy.
Introduction of a new multilingual 2D video dataset with 420 hours of content.
Effective incorporation of language-specific style embeddings.
Abstract
Recent studies in speech-driven 3D talking head generation have achieved convincing results in verbal articulations. However, generating accurate lip-syncs degrades when applied to input speech in other languages, possibly due to the lack of datasets covering a broad spectrum of facial movements across languages. In this work, we introduce a novel task to generate 3D talking heads from speeches of diverse languages. We collect a new multilingual 2D video dataset comprising over 420 hours of talking videos in 20 languages. With our proposed dataset, we present a multilingually enhanced model that incorporates language-specific style embeddings, enabling it to capture the unique mouth movements associated with each language. Additionally, we present a metric for assessing lip-sync accuracy in multilingual settings. We demonstrate that training a 3D talking head model with our proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Human Pose and Action Recognition · Multimodal Machine Learning Applications
