Speech2Video Synthesis with 3D Skeleton Regularization and Expressive Body Poses
Miao Liao, Sibo Zhang, Peng Wang, Hao Zhu, Xinxin Zuo, and Ruigang, Yang

TL;DR
This paper introduces a novel speech-to-video synthesis method that combines 3D skeleton regularization, expressive body poses, and part attention mechanisms in a GAN to produce realistic, synchronized talking videos with rich body dynamics.
Contribution
The paper presents a new framework integrating 3D skeleton regularization, gesture dictionaries, and part attention in GANs for improved speech-driven video synthesis.
Findings
Outperforms previous state-of-the-art methods in user studies
Generates realistic and expressive body movements from speech
Efficiently learns meaningful gestures with limited data
Abstract
In this paper, we propose a novel approach to convert given speech audio to a photo-realistic speaking video of a specific person, where the output video has synchronized, realistic, and expressive rich body dynamics. We achieve this by first generating 3D skeleton movements from the audio sequence using a recurrent neural network (RNN), and then synthesizing the output video via a conditional generative adversarial network (GAN). To make the skeleton movement realistic and expressive, we embed the knowledge of an articulated 3D human skeleton and a learned dictionary of personal speech iconic gestures into the generation process in both learning and testing pipelines. The former prevents the generation of unreasonable body distortion, while the later helps our model quickly learn meaningful body movement through a few recorded videos. To produce photo-realistic and high-resolution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation
