TL;DR
This paper introduces C-MET, a cross-modal emotion transfer method that enhances emotion expressiveness in talking face videos by modeling emotion semantic vectors across speech and visual modalities.
Contribution
It proposes a novel cross-modal approach using large-scale pretrained encoders to better transfer extended and nuanced emotions in talking face synthesis.
Findings
Improves emotion accuracy by 14% over state-of-the-art methods.
Generates expressive talking face videos for unseen extended emotions.
Demonstrates effectiveness on MEAD and CREMA-D datasets.
Abstract
Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
