DiffTalker: Co-driven audio-image diffusion for talking faces via   intermediate landmarks

Zipeng Qi; Xulong Zhang; Ning Cheng; Jing Xiao; Jianzong Wang

arXiv:2309.07509·cs.CV·September 15, 2023

DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

Zipeng Qi, Xulong Zhang, Ning Cheng, Jing Xiao, Jianzong Wang

PDF

Open Access

TL;DR

DiffTalker is a novel model that generates realistic talking faces by co-driving audio and landmarks, combining transformer and diffusion networks to produce accurate and articulate speaking faces without extra alignment.

Contribution

It introduces a dual-agent framework with landmark guidance to improve the realism and geometric accuracy of talking face generation using diffusion models.

Findings

01

Outperforms existing methods in face realism and accuracy

02

Produces articulate speaking faces without extra feature alignment

03

Efficiently integrates pre-trained diffusion models for face synthesis

Abstract

Generating realistic talking faces is a complex and widely discussed task with numerous applications. In this paper, we present DiffTalker, a novel model designed to generate lifelike talking faces through audio and landmark co-driving. DiffTalker addresses the challenges associated with directly applying diffusion models to audio control, which are traditionally trained on text-image pairs. DiffTalker consists of two agent networks: a transformer-based landmarks completion network for geometric accuracy and a diffusion-based face generation network for texture details. Landmarks play a pivotal role in establishing a seamless connection between the audio and image domains, facilitating the incorporation of knowledge from pre-trained diffusion models. This innovative approach efficiently produces articulate-speaking faces. Experimental results showcase DiffTalker's superior performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing

MethodsDiffusion