Taming Transformer for Emotion-Controllable Talking Face Generation

Ziqi Zhang; and Cheng Deng

arXiv:2508.14359·cs.CV·August 21, 2025

Taming Transformer for Emotion-Controllable Talking Face Generation

Ziqi Zhang, and Cheng Deng

PDF

Open Access

TL;DR

This paper introduces a novel transformer-based method for emotion-controllable talking face generation, effectively modeling multimodal relationships to synthesize realistic, emotion-specific talking videos while preserving identity.

Contribution

The paper proposes a new approach using pre-training, visual token quantization, and an emotion-anchor representation within an autoregressive transformer for emotion-controllable face synthesis.

Findings

01

Outperforms existing methods qualitatively and quantitatively

02

Successfully models multiple emotional states conditioned on audio

03

Achieves high-quality, identity-preserving talking face videos

Abstract

Talking face generation is a novel and challenging generation task, aiming at synthesizing a vivid speaking-face video given a specific audio. To fulfill emotion-controllable talking face generation, current methods need to overcome two challenges: One is how to effectively model the multimodal relationship related to the specific emotion, and the other is how to leverage this relationship to synthesize identity preserving emotional videos. In this paper, we propose a novel method to tackle the emotion-controllable talking face generation task discretely. Specifically, we employ two pre-training strategies to disentangle audio into independent components and quantize videos into combinations of visual tokens. Subsequently, we propose the emotion-anchor (EA) representation that integrates the emotional information into visual tokens. Finally, we introduce an autoregressive transformer to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis