TL;DR
This paper introduces an unsupervised framework for synthesizing realistic talking face videos that can imitate arbitrary talking styles from reference videos by learning style codes from 3DMM parameters.
Contribution
It proposes a novel style injection method that learns talking styles unsupervised from videos and can imitate and interpolate styles for more expressive talking face synthesis.
Findings
The framework can imitate styles without explicit annotations.
It produces more natural and expressive talking face videos.
Style interpolation enables new style generation.
Abstract
People talk with diversified styles. For one piece of speech, different talking styles exhibit significant differences in the facial and head pose movements. For example, the "excited" style usually talks with the mouth wide open, while the "solemn" style is more standardized and seldomly exhibits exaggerated motions. Due to such huge differences between different styles, it is necessary to incorporate the talking style into audio-driven talking face synthesis framework. In this paper, we propose to inject style into the talking face synthesis framework through imitating arbitrary talking style of the particular reference video. Specifically, we systematically investigate talking styles with our collected \textit{Ted-HD} dataset and construct style codes as several statistics of 3D morphable model~(3DMM) parameters. Afterwards, we devise a latent-style-fusion~(LSF) model to synthesize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
