Realistic Speech-to-Face Generation with Speech-Conditioned Latent   Diffusion Model with Face Prior

Jinting Wang; Li Liu; Jun Wang; Hei Victor Cheng

arXiv:2310.03363·cs.CV·October 6, 2023·1 cites

Realistic Speech-to-Face Generation with Speech-Conditioned Latent Diffusion Model with Face Prior

Jinting Wang, Li Liu, Jun Wang, Hei Victor Cheng

PDF

Open Access

TL;DR

This paper introduces a novel speech-to-face generation framework using a speech-conditioned latent diffusion model with face prior, significantly improving realism and identity preservation over previous GAN-based methods.

Contribution

It is the first to apply diffusion models to speech-to-face generation, incorporating contrastive pre-training and face priors for enhanced realism and identity consistency.

Findings

01

Achieves significant improvements in face realism and identity preservation.

02

Demonstrates superior performance on AVSpeech and Voxceleb datasets.

03

Outperforms state-of-the-art methods in quantitative and qualitative evaluations.

Abstract

Speech-to-face generation is an intriguing area of research that focuses on generating realistic facial images based on a speaker's audio speech. However, state-of-the-art methods employing GAN-based architectures lack stability and cannot generate realistic face images. To fill this gap, we propose a novel speech-to-face generation framework, which leverages a Speech-Conditioned Latent Diffusion Model, called SCLDM. To the best of our knowledge, this is the first work to harness the exceptional modeling capabilities of diffusion models for speech-to-face generation. Preserving the shared identity information between speech and face is crucial in generating realistic results. Therefore, we employ contrastive pre-training for both the speech encoder and the face encoder. This pre-training strategy facilitates effective alignment between the attributes of speech, such as age and gender,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing

MethodsLatent Diffusion Model · Diffusion