EDTalk++: Full Disentanglement for Controllable Talking Head Synthesis
Shuai Tan, Bin Ji

TL;DR
EDTalk++ introduces a comprehensive disentanglement framework for controllable talking head synthesis, enabling independent manipulation of facial features from diverse inputs, with improved control and realism.
Contribution
The paper presents a novel full disentanglement approach with four separate modules for facial features, orthogonality constraints, and an audio-to-motion module, advancing controllable talking head generation.
Findings
Effective disentanglement of facial features demonstrated
Independent control of mouth, pose, eye, and expression achieved
Enhanced realism and flexibility in talking head synthesis
Abstract
Achieving disentangled control over multiple facial motions and accommodating diverse input modalities greatly enhances the application and entertainment of the talking head generation. This necessitates a deep exploration of the decoupling space for facial features, ensuring that they a) operate independently without mutual interference and b) can be preserved to share with different modal inputs, both aspects often neglected in existing methods. To address this gap, this paper proposes EDTalk++, a novel full disentanglement framework for controllable talking head generation. Our framework enables individual manipulation of mouth shape, head pose, eye movement, and emotional expression, conditioned on video or audio inputs. Specifically, we employ four lightweight modules to decompose the facial dynamics into four distinct latent spaces representing mouth, pose, eye, and expression,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
