Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation
Jiadong Liang, Feng Lu

TL;DR
This paper introduces a two-stage framework for generating emotionally expressive talking face videos by aligning facial cues like expression, gaze, and pose with speech, using 3D landmarks and self-supervised learning.
Contribution
It presents a novel two-stage approach that synthesizes emotionally aligned facial landmarks and generates high-quality talking face videos, improving realism and emotional coherence.
Findings
Outperforms state-of-the-art in visual quality
Achieves better emotional alignment in generated videos
Demonstrates effectiveness on the MEAD dataset
Abstract
Vivid talking face generation holds immense potential applications across diverse multimedia domains, such as film and game production. While existing methods accurately synchronize lip movements with input audio, they typically ignore crucial alignments between emotion and facial cues, which include expression, gaze, and head pose. These alignments are indispensable for synthesizing realistic videos. To address these issues, we propose a two-stage audio-driven talking face generation framework that employs 3D facial landmarks as intermediate variables. This framework achieves collaborative alignment of expression, gaze, and pose with emotions through self-supervised learning. Specifically, we decompose this task into two key steps, namely speech-to-landmarks synthesis and landmarks-to-face generation. The first step focuses on simultaneously synthesizing emotionally aligned facial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage, Metaphor, and Cognition · Social Robot Interaction and HRI · Speech and dialogue systems
