DiT-Head: High-Resolution Talking Head Synthesis using Diffusion   Transformers

Aaron Mir; Eduardo Alonso; Esther Mondrag\'on

arXiv:2312.06400·cs.AI·December 12, 2023·1 cites

DiT-Head: High-Resolution Talking Head Synthesis using Diffusion Transformers

Aaron Mir, Eduardo Alonso, Esther Mondrag\'on

PDF

Open Access

TL;DR

DiT-Head introduces a diffusion transformer-based pipeline for high-resolution talking head synthesis driven by audio, achieving competitive visual quality and lip-sync accuracy across multiple identities.

Contribution

It presents a scalable diffusion transformer approach that generalizes to multiple identities for high-quality talking head synthesis using audio conditioning.

Findings

01

Competitive visual quality compared to existing methods

02

Accurate lip-sync performance demonstrated

03

Effective across multiple identities

Abstract

We propose a novel talking head synthesis pipeline called "DiT-Head", which is based on diffusion transformers and uses audio as a condition to drive the denoising process of a diffusion model. Our method is scalable and can generalise to multiple identities while producing high-quality results. We train and evaluate our proposed approach and compare it against existing methods of talking head synthesis. We show that our model can compete with these methods in terms of visual quality and lip-sync accuracy. Our results highlight the potential of our proposed approach to be used for a wide range of applications, including virtual assistants, entertainment, and education. For a video demonstration of the results and our user study, please refer to our supplementary material.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Speech and Audio Processing

MethodsDiffusion