LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization
Avisek Lahiri, Vivek Kwatra, Christian Frueh, John Lewis, Chris, Bregler

TL;DR
This paper introduces a data-efficient framework for animating personalized 3D talking faces from audio, using pose and lighting normalization to achieve high-quality lip-sync videos with minimal training data.
Contribution
It proposes novel pose and lighting normalization techniques that improve data efficiency and realism in 3D talking face synthesis from limited video data.
Findings
Outperforms state-of-the-art benchmarks in realism and lip-sync quality.
Achieves high-fidelity lip-sync with just a single speaker video.
Effectively handles novel lighting conditions during animation.
Abstract
In this paper, we present a video-based learning framework for animating personalized 3D talking faces from audio. We introduce two training-time data normalizations that significantly improve data sample efficiency. First, we isolate and represent faces in a normalized space that decouples 3D geometry, head pose, and texture. This decomposes the prediction problem into regressions over the 3D face shape and the corresponding 2D texture atlas. Second, we leverage facial symmetry and approximate albedo constancy of skin to isolate and remove spatio-temporal lighting variations. Together, these normalizations allow simple networks to generate high fidelity lip-sync videos under novel ambient illumination while training with just a single speaker-specific video. Further, to stabilize temporal dynamics, we introduce an auto-regressive approach that conditions the model on its previous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
