Synthesizing Photorealistic Virtual Humans Through Cross-modal   Disentanglement

Siddarth Ravichandran; Ond\v{r}ej Texler; Dimitar Dinev; Hyun Jae Kang

arXiv:2209.01320·cs.CV·March 27, 2023

Synthesizing Photorealistic Virtual Humans Through Cross-modal Disentanglement

Siddarth Ravichandran, Ond\v{r}ej Texler, Dimitar Dinev, Hyun Jae Kang

PDF

Open Access

TL;DR

This paper presents a real-time, end-to-end framework for synthesizing photorealistic virtual human faces with accurate lip synchronization, leveraging cross-modal disentanglement and hierarchical data augmentation to improve quality and performance.

Contribution

The authors introduce a novel network architecture using visemes for lip sync and a hierarchical augmentation strategy for disentangling control modalities, enabling real-time high-quality virtual human synthesis.

Findings

01

Runs in real-time with high visual quality

02

Outperforms current state-of-the-art methods

03

Achieves accurate lip synchronization

Abstract

Over the last few decades, many aspects of human life have been enhanced with virtual domains, from the advent of digital assistants such as Amazon's Alexa and Apple's Siri to the latest metaverse efforts of the rebranded Meta. These trends underscore the importance of generating photorealistic visual depictions of humans. This has led to the rapid growth of so-called deepfake and talking-head generation methods in recent years. Despite their impressive results and popularity, they usually lack certain qualitative aspects such as texture quality, lips synchronization, or resolution, and practical aspects such as the ability to run in real-time. To allow for virtual human avatars to be used in practical scenarios, we propose an end-to-end framework for synthesizing high-quality virtual human faces capable of speaking with accurate lip motion with a special emphasis on performance. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis