Realistic Speech-to-Face Generation with Speech-Conditioned Latent Diffusion Model with Face Prior
Jinting Wang, Li Liu, Jun Wang, Hei Victor Cheng

TL;DR
This paper introduces a novel speech-to-face generation framework using a speech-conditioned latent diffusion model with face prior, significantly improving realism and identity preservation over previous GAN-based methods.
Contribution
It is the first to apply diffusion models to speech-to-face generation, incorporating contrastive pre-training and face priors for enhanced realism and identity consistency.
Findings
Achieves significant improvements in face realism and identity preservation.
Demonstrates superior performance on AVSpeech and Voxceleb datasets.
Outperforms state-of-the-art methods in quantitative and qualitative evaluations.
Abstract
Speech-to-face generation is an intriguing area of research that focuses on generating realistic facial images based on a speaker's audio speech. However, state-of-the-art methods employing GAN-based architectures lack stability and cannot generate realistic face images. To fill this gap, we propose a novel speech-to-face generation framework, which leverages a Speech-Conditioned Latent Diffusion Model, called SCLDM. To the best of our knowledge, this is the first work to harness the exceptional modeling capabilities of diffusion models for speech-to-face generation. Preserving the shared identity information between speech and face is crucial in generating realistic results. Therefore, we employ contrastive pre-training for both the speech encoder and the face encoder. This pre-training strategy facilitates effective alignment between the attributes of speech, such as age and gender,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
MethodsLatent Diffusion Model · Diffusion
