Taming Transformer for Emotion-Controllable Talking Face Generation
Ziqi Zhang, and Cheng Deng

TL;DR
This paper introduces a novel transformer-based method for emotion-controllable talking face generation, effectively modeling multimodal relationships to synthesize realistic, emotion-specific talking videos while preserving identity.
Contribution
The paper proposes a new approach using pre-training, visual token quantization, and an emotion-anchor representation within an autoregressive transformer for emotion-controllable face synthesis.
Findings
Outperforms existing methods qualitatively and quantitatively
Successfully models multiple emotional states conditioned on audio
Achieves high-quality, identity-preserving talking face videos
Abstract
Talking face generation is a novel and challenging generation task, aiming at synthesizing a vivid speaking-face video given a specific audio. To fulfill emotion-controllable talking face generation, current methods need to overcome two challenges: One is how to effectively model the multimodal relationship related to the specific emotion, and the other is how to leverage this relationship to synthesize identity preserving emotional videos. In this paper, we propose a novel method to tackle the emotion-controllable talking face generation task discretely. Specifically, we employ two pre-training strategies to disentangle audio into independent components and quantize videos into combinations of visual tokens. Subsequently, we propose the emotion-anchor (EA) representation that integrates the emotional information into visual tokens. Finally, we introduce an autoregressive transformer to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis
