Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation
Yuan Gan, Zongxin Yang, Xihang Yue, Lingyun Sun, Yi Yang

TL;DR
This paper introduces a cost-effective method to enable emotion control in audio-driven talking-head generation by adapting existing models with lightweight, parameter-efficient modules, achieving state-of-the-art results.
Contribution
The proposed EAT method transforms emotion-agnostic models into emotion-controllable ones using lightweight adaptations, reducing training costs and improving flexibility.
Findings
Achieves state-of-the-art performance on LRW and MEAD benchmarks.
Demonstrates strong generalization with scarce or no emotional training data.
Introduces three lightweight modules for emotion control.
Abstract
Audio-driven talking-head synthesis is a popular research topic for virtual human-related applications. However, the inflexibility and inefficiency of existing methods, which necessitate expensive end-to-end training to transfer emotions from guidance videos to talking-head predictions, are significant limitations. In this work, we propose the Emotional Adaptation for Audio-driven Talking-head (EAT) method, which transforms emotion-agnostic talking-head models into emotion-controllable ones in a cost-effective and efficient manner through parameter-efficient adaptations. Our approach utilizes a pretrained emotion-agnostic talking-head transformer and introduces three lightweight adaptations (the Deep Emotional Prompts, Emotional Deformation Network, and Emotional Adaptation Module) from different perspectives to enable precise and realistic emotion controls. Our experiments demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis
