Everybody's Talkin': Let Me Talk as You Want
Linsen Song, Wayne Wu, Chen Qian, Ran He, Chen Change Loy

TL;DR
This paper introduces a dynamic, end-to-end method for editing portrait videos by translating source audio into realistic facial movements, preserving original geometry and pose, and ensuring temporal coherence.
Contribution
It presents a novel approach that factorizes video into expression, geometry, and pose, using a recurrent network to map audio to expressions without person-specific training.
Findings
Achieves high realism in talking portrait videos
Maintains original video context and pose
Robust to variations in source audio
Abstract
We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video. This method is unique because it is highly dynamic. It does not assume a person-specific rendering network yet capable of translating arbitrary source audio into arbitrary video output. Instead of learning a highly heterogeneous and nonlinear mapping from audio to the video directly, we first factorize each target video frame into orthogonal parameter spaces, i.e., expression, geometry, and pose, via monocular 3D face reconstruction. Next, a recurrent network is introduced to translate source audio into expression parameters that are primarily related to the audio content. The audio-translated expression parameters are then used to synthesize a photo-realistic human subject in each video frame, with the movement of the mouth regions precisely mapped to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Face recognition and analysis
