Speech-Driven Expressive Talking Lips with Conditional Sequential Generative Adversarial Networks
Najmeh Sadoughi, Carlos Busso

TL;DR
This paper introduces a speech-driven conditional GAN model that generates realistic expressive lip movements for virtual agents, effectively modeling emotion and lexical content without needing transcripts, and outperforms existing methods.
Contribution
It proposes a novel conditional sequential GAN (CSG) framework that models emotion and speech content interactions directly from speech features, improving lip movement realism.
Findings
CSG outperforms three state-of-the-art baselines objectively and subjectively.
Emotion-dependent models enhance lip movement expressiveness.
Emotion-aware adaptation improves emotional consistency in generated movements.
Abstract
Articulation, emotion, and personality play strong roles in the orofacial movements. To improve the naturalness and expressiveness of virtual agents (VAs), it is important that we carefully model the complex interplay between these factors. This paper proposes a conditional generative adversarial network, called conditional sequential GAN (CSG), which learns the relationship between emotion and lexical content in a principled manner. This model uses a set of articulatory and emotional features directly extracted from the speech signal as conditioning inputs, generating realistic movements. A key feature of the approach is that it is a speech-driven framework that does not require transcripts. Our experiments show the superiority of this model over three state-of-the-art baselines in terms of objective and subjective evaluations. When the target emotion is known, we propose to create…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
