Talking Face Generation by Adversarially Disentangled Audio-Visual Representation
Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, Xiaogang Wang

TL;DR
This paper introduces a novel method for arbitrary-subject talking face generation by learning a disentangled audio-visual representation, enabling realistic synthesis and improved lip reading and retrieval tasks.
Contribution
It proposes a new associative-and-adversarial training framework to explicitly disentangle subject and speech information in audio-visual data.
Findings
Generates realistic talking face sequences for arbitrary subjects.
Produces clearer lip motion patterns than previous methods.
Enhances performance in lip reading and audio-video retrieval tasks.
Abstract
Talking face generation aims to synthesize a sequence of face images that correspond to a clip of speech. This is a challenging task because face appearance variation and semantics of speech are coupled together in the subtle movements of the talking face regions. Existing works either construct specific face appearance model on specific subjects or model the transformation between lip motion and speech. In this work, we integrate both aspects and enable arbitrary-subject talking face generation by learning disentangled audio-visual representation. We find that the talking face sequence is actually a composition of both subject-related information and speech-related information. These two spaces are then explicitly disentangled through a novel associative-and-adversarial training process. This disentangled representation has an advantage where both audio and video can serve as inputs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Speech and Audio Processing
