DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation
Chenxu Zhang, Chao Wang, Jianfeng Zhang, Hongyi Xu, Guoxian Song, You, Xie, Linjie Luo, Yapeng Tian, Xiaohu Guo, Jiashi Feng

TL;DR
DREAM-Talk is a diffusion-based framework that generates realistic, emotionally expressive talking face videos from a single image, balancing lip-sync accuracy and expressive diversity.
Contribution
It introduces EmoDiff, a novel diffusion module for dynamic emotional expression generation, and a two-stage process for improved lip-sync and expressiveness in talking face synthesis.
Findings
Outperforms state-of-the-art in expressiveness and lip-sync accuracy
Generates diverse emotional expressions aligned with audio
Achieves high perceptual quality in synthesized videos
Abstract
The generation of emotional talking faces from a single portrait image remains a significant challenge. The simultaneous achievement of expressive emotional talking and accurate lip-sync is particularly difficult, as expressiveness is often compromised for the accuracy of lip-sync. As widely adopted by many prior works, the LSTM network often fails to capture the subtleties and variations of emotional expressions. To address these challenges, we introduce DREAM-Talk, a two-stage diffusion-based audio-driven framework, tailored for generating diverse expressions and accurate lip-sync concurrently. In the first stage, we propose EmoDiff, a novel diffusion module that generates diverse highly dynamic emotional expressions and head poses in accordance with the audio and the referenced emotion style. Given the strong correlation between lip motion and audio, we then refine the dynamics with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Diffusion
