RealTalk: Realistic Emotion-Aware Lifelike Talking-Head Synthesis

Wenqing Wang; Yun Fu

arXiv:2508.12163·cs.CV·August 19, 2025

RealTalk: Realistic Emotion-Aware Lifelike Talking-Head Synthesis

Wenqing Wang, Yun Fu

PDF

TL;DR

RealTalk is a new framework that synthesizes realistic emotional talking-head videos with high emotion accuracy, controllability, and identity preservation, advancing socially intelligent AI systems.

Contribution

It introduces a novel combination of VAE, landmark deformation, and tri-plane attention NeRF for emotion-aware talking-head synthesis.

Findings

01

Outperforms existing methods in emotion accuracy

02

Enhances emotion controllability in generated videos

03

Maintains high identity preservation

Abstract

Emotion is a critical component of artificial social intelligence. However, while current methods excel in lip synchronization and image quality, they often fail to generate accurate and controllable emotional expressions while preserving the subject's identity. To address this challenge, we introduce RealTalk, a novel framework for synthesizing emotional talking heads with high emotion accuracy, enhanced emotion controllability, and robust identity preservation. RealTalk employs a variational autoencoder (VAE) to generate 3D facial landmarks from driving audio, which are concatenated with emotion-label embeddings using a ResNet-based landmark deformation model (LDM) to produce emotional landmarks. These landmarks and facial blendshape coefficients jointly condition a novel tri-plane attention Neural Radiance Field (NeRF) to synthesize highly realistic emotional talking heads. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.