MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation
Yucheng Wang, Dan Xu

TL;DR
This paper introduces MoDiT, a diffusion transformer framework that enhances talking head generation by improving temporal consistency, identity preservation, and natural blinking through 3D modeling and hierarchical denoising.
Contribution
MoDiT combines 3DMM with diffusion transformers, introducing hierarchical denoising, explicit 3D constraints, and realistic blinking modeling for superior talking head synthesis.
Findings
Reduces temporal jittering in generated videos
Improves facial identity preservation during animation
Produces more natural and realistic blinking behaviors
Abstract
Audio-driven talking head generation is critical for applications such as virtual assistants, video games, and films, where natural lip movements are essential. Despite progress in this field, challenges remain in producing both consistent and realistic facial animations. Existing methods, often based on GANs or UNet-based diffusion models, face three major limitations: (i) temporal jittering caused by weak temporal constraints, resulting in frame inconsistencies; (ii) identity drift due to insufficient 3D information extraction, leading to poor preservation of facial identity; and (iii) unnatural blinking behavior due to inadequate modeling of realistic blink dynamics. To address these issues, we propose MoDiT, a novel framework that combines the 3D Morphable Model (3DMM) with a Diffusion-based Transformer. Our contributions include: (i) A hierarchical denoising strategy with revised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
