RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network
Xiaozhong Ji, Chuming Lin, Zhonggan Ding, Ying Tai, Junwei Zhu,, Xiaobin Hu, Donghao Luo, Yanhao Ge, Chengjie Wang

TL;DR
RealTalk is a novel framework for real-time, high-quality audio-driven face generation that effectively preserves individual traits and lip synchronization using a 3D facial prior-guided identity alignment network.
Contribution
The paper introduces a generalized framework combining an audio-to-expression transformer with a lightweight face renderer and a facial identity alignment module for improved accuracy and efficiency.
Findings
Outperforms previous methods in lip-speech synchronization.
Generates high-quality facial renderings in real-time.
Requires fewer computational resources.
Abstract
Person-generic audio-driven face generation is a challenging task in computer vision. Previous methods have achieved remarkable progress in audio-visual synchronization, but there is still a significant gap between current results and practical applications. The challenges are two-fold: 1) Preserving unique individual traits for achieving high-precision lip synchronization. 2) Generating high-quality facial renderings in real-time performance. In this paper, we propose a novel generalized audio-driven framework RealTalk, which consists of an audio-to-expression transformer and a high-fidelity expression-to-face renderer. In the first component, we consider both identity and intra-personal variation features related to speaking lip movements. By incorporating cross-modal attention on the enriched facial priors, we can effectively align lip movements with audio, thus attaining greater…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
MethodsSoftmax · Attention Is All You Need · ALIGN
