RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D   Facial Prior-guided Identity Alignment Network

Xiaozhong Ji; Chuming Lin; Zhonggan Ding; Ying Tai; Junwei Zhu,; Xiaobin Hu; Donghao Luo; Yanhao Ge; Chengjie Wang

arXiv:2406.18284·cs.CV·August 9, 2024

RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network

Xiaozhong Ji, Chuming Lin, Zhonggan Ding, Ying Tai, Junwei Zhu,, Xiaobin Hu, Donghao Luo, Yanhao Ge, Chengjie Wang

PDF

Open Access

TL;DR

RealTalk is a novel framework for real-time, high-quality audio-driven face generation that effectively preserves individual traits and lip synchronization using a 3D facial prior-guided identity alignment network.

Contribution

The paper introduces a generalized framework combining an audio-to-expression transformer with a lightweight face renderer and a facial identity alignment module for improved accuracy and efficiency.

Findings

01

Outperforms previous methods in lip-speech synchronization.

02

Generates high-quality facial renderings in real-time.

03

Requires fewer computational resources.

Abstract

Person-generic audio-driven face generation is a challenging task in computer vision. Previous methods have achieved remarkable progress in audio-visual synchronization, but there is still a significant gap between current results and practical applications. The challenges are two-fold: 1) Preserving unique individual traits for achieving high-precision lip synchronization. 2) Generating high-quality facial renderings in real-time performance. In this paper, we propose a novel generalized audio-driven framework RealTalk, which consists of an audio-to-expression transformer and a high-fidelity expression-to-face renderer. In the first component, we consider both identity and intra-personal variation features related to speaking lip movements. By incorporating cross-modal attention on the enriched facial priors, we can effectively align lip movements with audio, thus attaining greater…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing

MethodsSoftmax · Attention Is All You Need · ALIGN