MEDTalk: Multimodal Controlled 3D Facial Animation with Dynamic Emotions by Disentangled Embedding
Chang Liu, Ye Pan, Chenyang Ding, Susanto Rahardja, Xiaokang Yang

TL;DR
MEDTalk is a novel framework for dynamic, fine-grained emotional 3D facial animation that disentangles content and emotion, integrating multimodal inputs for realistic and controllable talking head generation.
Contribution
It introduces a disentangled embedding approach for independent control of lip movements and facial expressions, incorporating multimodal inputs for personalized, realistic emotional talking head synthesis.
Findings
Achieves synchronized lip movements with vivid emotional expressions.
Enables control over facial expressions using text and reference images.
Supports integration into industrial production pipelines.
Abstract
Audio-driven emotional 3D facial animation aims to generate synchronized lip movements and vivid facial expressions. However, most existing approaches focus on static and predefined emotion labels, limiting their diversity and naturalness. To address these challenges, we propose MEDTalk, a novel framework for fine-grained and dynamic emotional talking head generation. Our approach first disentangles content and emotion embedding spaces from motion sequences using a carefully designed cross-reconstruction process, enabling independent control over lip movements and facial expressions. Beyond conventional audio-driven lip synchronization, we integrate audio and speech text, predicting frame-wise intensity variations and dynamically adjusting static emotion features to generate realistic emotional expressions. Furthermore, to enhance control and personalization, we incorporate multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus
