MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation

Yucheng Wang; Dan Xu

arXiv:2507.05092·cs.CV·July 8, 2025

MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation

Yucheng Wang, Dan Xu

PDF

TL;DR

This paper introduces MoDiT, a diffusion transformer framework that enhances talking head generation by improving temporal consistency, identity preservation, and natural blinking through 3D modeling and hierarchical denoising.

Contribution

MoDiT combines 3DMM with diffusion transformers, introducing hierarchical denoising, explicit 3D constraints, and realistic blinking modeling for superior talking head synthesis.

Findings

01

Reduces temporal jittering in generated videos

02

Improves facial identity preservation during animation

03

Produces more natural and realistic blinking behaviors

Abstract

Audio-driven talking head generation is critical for applications such as virtual assistants, video games, and films, where natural lip movements are essential. Despite progress in this field, challenges remain in producing both consistent and realistic facial animations. Existing methods, often based on GANs or UNet-based diffusion models, face three major limitations: (i) temporal jittering caused by weak temporal constraints, resulting in frame inconsistencies; (ii) identity drift due to insufficient 3D information extraction, leading to poor preservation of facial identity; and (iii) unnatural blinking behavior due to inadequate modeling of realistic blink dynamics. To address these issues, we propose MoDiT, a novel framework that combines the 3D Morphable Model (3DMM) with a Diffusion-based Transformer. Our contributions include: (i) A hierarchical denoising strategy with revised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.