Synthesizing Photorealistic Virtual Humans Through Cross-modal Disentanglement
Siddarth Ravichandran, Ond\v{r}ej Texler, Dimitar Dinev, Hyun Jae Kang

TL;DR
This paper presents a real-time, end-to-end framework for synthesizing photorealistic virtual human faces with accurate lip synchronization, leveraging cross-modal disentanglement and hierarchical data augmentation to improve quality and performance.
Contribution
The authors introduce a novel network architecture using visemes for lip sync and a hierarchical augmentation strategy for disentangling control modalities, enabling real-time high-quality virtual human synthesis.
Findings
Runs in real-time with high visual quality
Outperforms current state-of-the-art methods
Achieves accurate lip synchronization
Abstract
Over the last few decades, many aspects of human life have been enhanced with virtual domains, from the advent of digital assistants such as Amazon's Alexa and Apple's Siri to the latest metaverse efforts of the rebranded Meta. These trends underscore the importance of generating photorealistic visual depictions of humans. This has led to the rapid growth of so-called deepfake and talking-head generation methods in recent years. Despite their impressive results and popularity, they usually lack certain qualitative aspects such as texture quality, lips synchronization, or resolution, and practical aspects such as the ability to run in real-time. To allow for virtual human avatars to be used in practical scenarios, we propose an end-to-end framework for synthesizing high-quality virtual human faces capable of speaking with accurate lip motion with a special emphasis on performance. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis
