SyncTalk++: High-Fidelity and Efficient Synchronized Talking Heads Synthesis Using Gaussian Splatting

Ziqiao Peng; Wentao Hu; Junyuan Ma; Xiangyu Zhu; Xiaomei Zhang; Hao Zhao; Hui Tian; Jun He; Hongyan Liu; Zhaoxin Fan

arXiv:2506.14742·cs.CV·June 18, 2025

SyncTalk++: High-Fidelity and Efficient Synchronized Talking Heads Synthesis Using Gaussian Splatting

Ziqiao Peng, Wentao Hu, Junyuan Ma, Xiangyu Zhu, Xiaomei Zhang, Hao Zhao, Hui Tian, Jun He, Hongyan Liu, Zhaoxin Fan

PDF

Open Access

TL;DR

SyncTalk++ is a novel framework that significantly improves the synchronization, realism, and rendering speed of speech-driven talking head videos through innovative 3D modeling, stabilization, and robustness techniques.

Contribution

It introduces a comprehensive system combining Gaussian Splatting, 3D facial modeling, and stabilization methods to enhance synchronization and realism in talking head synthesis.

Findings

01

Achieves up to 101 fps rendering speed.

02

Outperforms state-of-the-art in synchronization accuracy.

03

Demonstrates improved realism through user studies.

Abstract

Achieving high synchronization in the synthesis of realistic, speech-driven talking head videos presents a significant challenge. A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses. The absence of these synchronizations is a fundamental flaw, leading to unrealistic results. To address the critical issue of synchronization, identified as the ''devil'' in creating realistic talking heads, we introduce SyncTalk++, which features a Dynamic Portrait Renderer with Gaussian Splatting to ensure consistent subject identity preservation and a Face-Sync Controller that aligns lip movements with speech while innovatively using a 3D facial blendshape model to reconstruct accurate facial expressions. To ensure natural head movements, we propose a Head-Sync Stabilizer, which optimizes head poses for greater stability.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Cellular Automata and Applications · Interactive and Immersive Displays

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings