GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking   Face Generation

Zhenhui Ye; Jinzheng He; Ziyue Jiang; Rongjie Huang; Jiawei Huang,; Jinglin Liu; Yi Ren; Xiang Yin; Zejun Ma; Zhou Zhao

arXiv:2305.00787·cs.CV·May 2, 2023·5 cites

GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation

Zhenhui Ye, Jinzheng He, Ziyue Jiang, Rongjie Huang, Jiawei Huang,, Jinglin Liu, Yi Ren, Xiang Yin, Zejun Ma, Zhou Zhao

PDF

Open Access 2 Models

TL;DR

GeneFace++ introduces a novel NeRF-based method for real-time, stable, and generalized audio-driven 3D talking face generation, addressing key challenges in lip synchronization, video quality, and system efficiency.

Contribution

It proposes new techniques including auxiliary pitch features, a landmark locally linear embedding, and an efficient NeRF renderer to improve stability, quality, and speed.

Findings

01

Achieves stable, real-time talking face generation with high audio-lip synchronization.

02

Outperforms state-of-the-art methods in subjective and objective evaluations.

03

Demonstrates robustness to out-of-domain inputs and high-quality rendering.

Abstract

Generating talking person portraits with arbitrary speech audio is a crucial problem in the field of digital human and metaverse. A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency. Recently, neural radiance field (NeRF) has become a popular rendering technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video. However, there still exist several challenges for NeRF-based methods: 1) as for the lip synchronization, it is hard to generate a long facial motion sequence of high temporal consistency and audio-lip accuracy; 2) as for the video quality, due to the limited data used to train the renderer, it is vulnerable to out-of-domain input condition and produce bad rendering results occasionally; 3)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings