Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis

Tianqi Li; Ruobing Zheng; Minghui Yang; Jingdong Chen; Ming Yang

arXiv:2411.19509·cs.CV·March 9, 2026

Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis

Tianqi Li, Ruobing Zheng, Minghui Yang, Jingdong Chen, Ming Yang

PDF

Open Access 2 Repos 1 Models

TL;DR

Ditto is a diffusion-based framework for talking head synthesis that offers fine-grained control and real-time inference, overcoming speed and control limitations of previous models.

Contribution

It introduces a novel diffusion transformer in motion space, optimized for disentanglement, control, and real-time streaming in talking head synthesis.

Findings

01

Generates high-quality talking head videos with vivid expressions.

02

Achieves real-time inference with low delay and streaming capability.

03

Demonstrates superior controllability compared to prior methods.

Abstract

Recent advances in diffusion models have endowed talking head synthesis with subtle expressions and vivid head movements, but have also led to slow inference speed and insufficient control over generated results. To address these issues, we propose Ditto, a diffusion-based talking head framework that enables fine-grained controls and real-time inference. Specifically, we utilize an off-the-shelf motion extractor and devise a diffusion transformer to generate representations in a specific motion space. We optimize the model architecture and training strategy to address the issues in generating motion representations, including insufficient disentanglement between motion and identity, and large internal discrepancies within the representation. Besides, we employ diverse conditional signals while establishing a mapping between motion representation and facial semantics, enabling control…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
digital-avatar/ditto-talkinghead
model· ♡ 33
♡ 33

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Automated Systems · Hand Gesture Recognition Systems · Social Robot Interaction and HRI

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion