EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation
Guanwen Feng, Haoran Cheng, Yunan Li, Zhiyuan Ma, Chaoneng Li, Zhihao, Qian, Qiguang Miao, Chi-Man Pun

TL;DR
EmoSpeaker is a novel method for fine-grained emotion-controlled talking face generation that improves emotional expression and lip synchronization using a visual attribute-guided audio decoupler and emotion intensity control.
Contribution
The paper introduces a visual attribute-guided audio decoupler, a fine-grained emotion coefficient prediction module, and an emotion intensity control method, advancing emotion control in talking face generation.
Findings
Outperforms existing methods in expression variation
Enhances lip synchronization accuracy
Enables finer emotion intensity classification
Abstract
Implementing fine-grained emotion control is crucial for emotion generation tasks because it enhances the expressive capability of the generative model, allowing it to accurately and comprehensively capture and express various nuanced emotional states, thereby improving the emotional quality and personalization of generated content. Generating fine-grained facial animations that accurately portray emotional expressions using only a portrait and an audio recording presents a challenge. In order to address this challenge, we propose a visual attribute-guided audio decoupler. This enables the obtention of content vectors solely related to the audio content, enhancing the stability of subsequent lip movement coefficient predictions. To achieve more precise emotional expression, we introduce a fine-grained emotion coefficient prediction module. Additionally, we propose an emotion intensity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis
