StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles
Yifeng Ma, Suzhen Wang, Zhipeng Hu, Changjie Fan, Tangjie Lv, Yu Ding,, Zhidong Deng, Xin Yu

TL;DR
StyleTalk introduces a novel framework for one-shot talking head generation that allows controllable speaking styles by extracting style from reference videos and integrating it into synthesized videos using a style-aware transformer.
Contribution
The paper proposes a style encoder, style-controllable decoder, and style-aware transformer to enable diverse, style-controllable talking head synthesis from a single image and audio.
Findings
Capable of generating diverse speaking styles from one portrait and audio.
Achieves realistic and authentic visual effects in generated videos.
Outperforms existing methods in style controllability and visual quality.
Abstract
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Human Motion and Animation
MethodsContrastive Language-Image Pre-training
