GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation
Zhenhui Ye, Jinzheng He, Ziyue Jiang, Rongjie Huang, Jiawei Huang,, Jinglin Liu, Yi Ren, Xiang Yin, Zejun Ma, Zhou Zhao

TL;DR
GeneFace++ introduces a novel NeRF-based method for real-time, stable, and generalized audio-driven 3D talking face generation, addressing key challenges in lip synchronization, video quality, and system efficiency.
Contribution
It proposes new techniques including auxiliary pitch features, a landmark locally linear embedding, and an efficient NeRF renderer to improve stability, quality, and speed.
Findings
Achieves stable, real-time talking face generation with high audio-lip synchronization.
Outperforms state-of-the-art methods in subjective and objective evaluations.
Demonstrates robustness to out-of-domain inputs and high-quality rendering.
Abstract
Generating talking person portraits with arbitrary speech audio is a crucial problem in the field of digital human and metaverse. A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency. Recently, neural radiance field (NeRF) has become a popular rendering technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video. However, there still exist several challenges for NeRF-based methods: 1) as for the lip synchronization, it is hard to generate a long facial motion sequence of high temporal consistency and audio-lip accuracy; 2) as for the video quality, due to the limited data used to train the renderer, it is vulnerable to out-of-domain input condition and produce bad rendering results occasionally; 3)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
