Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation
Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang,, Ziwei Liu

TL;DR
This paper introduces a novel framework for generating pose-controllable talking faces from a single image, effectively handling pose control, lip synchronization, and extreme view robustness without relying on structural estimations.
Contribution
The proposed method models audio-visual representations with an implicit pose code, enabling accurate pose control and lip synchronization directly from raw images, surpassing previous landmark-based approaches.
Findings
Accurately lip-synced talking faces with controllable poses.
Robustness to extreme viewing angles.
Effective frontalization of talking faces.
Abstract
While accurate lip synchronization has been achieved for arbitrary-subject audio-driven talking face generation, the problem of how to efficiently drive the head pose remains. Previous methods rely on pre-estimated structural information such as landmarks and 3D parameters, aiming to generate personalized rhythmic movements. However, the inaccuracy of such estimated information under extreme conditions would lead to degradation problems. In this paper, we propose a clean yet effective framework to generate pose-controllable talking faces. We operate on raw face images, using only a single photo as an identity reference. The key is to modularize audio-visual representations by devising an implicit low-dimension pose code. Substantially, both speech content and head pose information lie in a joint non-identity embedding space. While speech content information can be defined by learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
