TL;DR
AUHead is a two-stage method that enables fine-grained emotional control in talking-head video generation by disentangling and manipulating Action Units, resulting in more realistic and expressive virtual avatars.
Contribution
The paper introduces a novel two-stage approach combining large audio-language models and a controllable diffusion model for emotion-aware talking-head synthesis.
Findings
Achieves superior emotional realism and lip synchronization on benchmark datasets.
Effectively disentangles and controls Action Units for nuanced emotional expression.
Outperforms existing methods in visual coherence and identity preservation.
Abstract
Realistic talking-head video generation is critical for virtual avatars, film production, and interactive systems. Current methods struggle with nuanced emotional expressions due to the lack of fine-grained emotion control. To address this issue, we introduce a novel two-stage method (AUHead) to disentangle fine-grained emotion control, i.e. , Action Units (AUs), from audio and achieve controllable generation. In the first stage, we explore the AU generation abilities of large audio-language models (ALMs), by spatial-temporal AU tokenization and an "emotion-then-AU" chain-of-thought mechanism. It aims to disentangle AUs from raw speech, effectively capturing subtle emotional cues. In the second stage, we propose an AU-driven controllable diffusion model that synthesizes realistic talking-head videos conditioned on AU sequences. Specifically, we first map the AU sequences into the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
