JOLT3D: Joint Learning of Talking Heads and 3DMM Parameters with Application to Lip-Sync

Sungjoon Park; Minsik Park; Haneol Lee; Jaesub Yun; Donggeon Lee

arXiv:2507.20452·cs.CV·July 29, 2025

JOLT3D: Joint Learning of Talking Heads and 3DMM Parameters with Application to Lip-Sync

Sungjoon Park, Minsik Park, Haneol Lee, Jaesub Yun, Donggeon Lee

PDF

Open Access

TL;DR

JOLT3D introduces a joint learning framework for 3D face reconstruction and talking head synthesis, enhancing lip-sync quality and facial expression control by leveraging a FACS-based blendshape representation.

Contribution

It presents a novel joint learning approach that improves 3DMM-based talking head synthesis and enables targeted mouth modifications for better lip-sync.

Findings

01

Improved facial synthesis quality.

02

Enhanced lip-sync accuracy.

03

Reduced flickering near the mouth.

Abstract

In this work, we revisit the effectiveness of 3DMM for talking head synthesis by jointly learning a 3D face reconstruction model and a talking head synthesis model. This enables us to obtain a FACS-based blendshape representation of facial expressions that is optimized for talking head synthesis. This contrasts with previous methods that either fit 3DMM parameters to 2D landmarks or rely on pretrained face reconstruction models. Not only does our approach increase the quality of the generated face, but it also allows us to take advantage of the blendshape representation to modify just the mouth region for the purpose of audio-based lip-sync. To this end, we propose a novel lip-sync pipeline that, unlike previous methods, decouples the original chin contour from the lip-synced chin contour, and reduces flickering near the mouth.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing