Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis
Tianqi Li, Ruobing Zheng, Minghui Yang, Jingdong Chen, Ming Yang

TL;DR
Ditto is a diffusion-based framework for talking head synthesis that offers fine-grained control and real-time inference, overcoming speed and control limitations of previous models.
Contribution
It introduces a novel diffusion transformer in motion space, optimized for disentanglement, control, and real-time streaming in talking head synthesis.
Findings
Generates high-quality talking head videos with vivid expressions.
Achieves real-time inference with low delay and streaming capability.
Demonstrates superior controllability compared to prior methods.
Abstract
Recent advances in diffusion models have endowed talking head synthesis with subtle expressions and vivid head movements, but have also led to slow inference speed and insufficient control over generated results. To address these issues, we propose Ditto, a diffusion-based talking head framework that enables fine-grained controls and real-time inference. Specifically, we utilize an off-the-shelf motion extractor and devise a diffusion transformer to generate representations in a specific motion space. We optimize the model architecture and training strategy to address the issues in generating motion representations, including insufficient disentanglement between motion and identity, and large internal discrepancies within the representation. Besides, we employ diverse conditional signals while establishing a mapping between motion representation and facial semantics, enabling control…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems · Hand Gesture Recognition Systems · Social Robot Interaction and HRI
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion
