See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement
Jinting Wang, Jun Wang, Hei Victor Cheng, Li Liu

TL;DR
This paper introduces a novel speech-to-talking face method that directly extracts information from speech to generate high-resolution, high-quality talking face videos without source images, outperforming existing approaches.
Contribution
It presents a new approach combining speech-conditioned diffusion models, statistical priors, and region refinement to generate high-resolution talking faces directly from speech.
Findings
Outperforms existing methods on HDTF, VoxCeleb, AVSpeech datasets.
First method to generate high-res talking faces solely from speech.
Achieves high-quality, synchronized talking face videos.
Abstract
Unlike existing methods that rely on source images as appearance references and use source speech to generate motion, this work proposes a novel approach that directly extracts information from the speech, addressing key challenges in speech-to-talking face. Specifically, we first employ a speech-to-face portrait generation stage, utilizing a speech-conditioned diffusion model combined with statistical facial prior and a sample-adaptive weighting module to achieve high-quality portrait generation. In the subsequent speech-driven talking face generation stage, we embed expressive dynamics such as lip movement, facial expressions, and eye movements into the latent space of the diffusion model and further optimize lip synchronization using a region-enhancement module. To generate high-resolution outputs, we integrate a pre-trained Transformer-based discrete codebook with an image rendering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
