PTalker: Personalized Speech-Driven 3D Talking Head Animation via Style Disentanglement and Modality Alignment
Bin Wang, Yang Xu, Huan Zhao, Hao Zhang, Zixing Zhang

TL;DR
PTalker is a novel framework for personalized 3D talking head animation that disentangles speaking style from speech and aligns audio and facial motion modalities to produce realistic, style-preserving animations.
Contribution
It introduces a style disentanglement approach and a three-level modality alignment mechanism to enhance personalization and lip-sync accuracy in 3D talking head generation.
Findings
Outperforms state-of-the-art methods in realism and style preservation
Achieves high lip-synchronization accuracy through multi-level alignment
Generates personalized 3D talking heads matching individual speaking styles
Abstract
Speech-driven 3D talking head generation aims to produce lifelike facial animations precisely synchronized with speech. While considerable progress has been made in achieving high lip-synchronization accuracy, existing methods largely overlook the intricate nuances of individual speaking styles, which limits personalization and realism. In this work, we present a novel framework for personalized 3D talking head animation, namely "PTalker". This framework preserves speaking style through style disentanglement from audio and facial motion sequences and enhances lip-synchronization accuracy through a three-level alignment mechanism between audio and mesh modalities. Specifically, to effectively disentangle style and content, we design disentanglement constraints that encode driven audio and motion sequences into distinct style and content spaces to enhance speaking style representation. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
