ESGaussianFace: Emotional and Stylized Audio-Driven Facial Animation via 3D Gaussian Splatting
Chuhang Ma, Shuai Tan, Ye Pan, Jiaolong Yang, Xin Tong

TL;DR
ESGaussianFace introduces a novel 3D Gaussian splatting framework for efficient, high-quality, emotionally expressive, and stylized audio-driven facial animation with 3D consistency, outperforming existing methods.
Contribution
The paper presents a new framework combining 3D Gaussian splatting, emotion-guided spatial attention, and multi-stage training for realistic emotional and stylized facial animation from audio.
Findings
Outperforms state-of-the-art in lip accuracy and expression variation
Achieves high efficiency and 3D consistency in facial animation
Effectively integrates emotion and style features for realistic results
Abstract
Most current audio-driven facial animation research primarily focuses on generating videos with neutral emotions. While some studies have addressed the generation of facial videos driven by emotional audio, efficiently generating high-quality talking head videos that integrate both emotional expressions and style features remains a significant challenge. In this paper, we propose ESGaussianFace, an innovative framework for emotional and stylized audio-driven facial animation. Our approach leverages 3D Gaussian Splatting to reconstruct 3D scenes and render videos, ensuring efficient generation of 3D consistent results. We propose an emotion-audio-guided spatial attention method that effectively integrates emotion features with audio content features. Through emotion-guided attention, the model is able to reconstruct facial details across different emotional states more accurately. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
