DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for   Single Image Talking Face Generation

Chenxu Zhang; Chao Wang; Jianfeng Zhang; Hongyi Xu; Guoxian Song; You; Xie; Linjie Luo; Yapeng Tian; Xiaohu Guo; Jiashi Feng

arXiv:2312.13578·cs.CV·December 22, 2023·1 cites

DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation

Chenxu Zhang, Chao Wang, Jianfeng Zhang, Hongyi Xu, Guoxian Song, You, Xie, Linjie Luo, Yapeng Tian, Xiaohu Guo, Jiashi Feng

PDF

Open Access

TL;DR

DREAM-Talk is a diffusion-based framework that generates realistic, emotionally expressive talking face videos from a single image, balancing lip-sync accuracy and expressive diversity.

Contribution

It introduces EmoDiff, a novel diffusion module for dynamic emotional expression generation, and a two-stage process for improved lip-sync and expressiveness in talking face synthesis.

Findings

01

Outperforms state-of-the-art in expressiveness and lip-sync accuracy

02

Generates diverse emotional expressions aligned with audio

03

Achieves high perceptual quality in synthesized videos

Abstract

The generation of emotional talking faces from a single portrait image remains a significant challenge. The simultaneous achievement of expressive emotional talking and accurate lip-sync is particularly difficult, as expressiveness is often compromised for the accuracy of lip-sync. As widely adopted by many prior works, the LSTM network often fails to capture the subtleties and variations of emotional expressions. To address these challenges, we introduce DREAM-Talk, a two-stage diffusion-based audio-driven framework, tailored for generating diverse expressions and accurate lip-sync concurrently. In the first stage, we propose EmoDiff, a novel diffusion module that generates diverse highly dynamic emotional expressions and head poses in accordance with the audio and the referenced emotion style. Given the strong correlation between lip motion and audio, we then refine the dynamics with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Diffusion