Speech2Video Synthesis with 3D Skeleton Regularization and Expressive   Body Poses

Miao Liao; Sibo Zhang; Peng Wang; Hao Zhu; Xinxin Zuo; and Ruigang; Yang

arXiv:2007.09198·cs.CV·October 12, 2020·5 cites

Speech2Video Synthesis with 3D Skeleton Regularization and Expressive Body Poses

Miao Liao, Sibo Zhang, Peng Wang, Hao Zhu, Xinxin Zuo, and Ruigang, Yang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel speech-to-video synthesis method that combines 3D skeleton regularization, expressive body poses, and part attention mechanisms in a GAN to produce realistic, synchronized talking videos with rich body dynamics.

Contribution

The paper presents a new framework integrating 3D skeleton regularization, gesture dictionaries, and part attention in GANs for improved speech-driven video synthesis.

Findings

01

Outperforms previous state-of-the-art methods in user studies

02

Generates realistic and expressive body movements from speech

03

Efficiently learns meaningful gestures with limited data

Abstract

In this paper, we propose a novel approach to convert given speech audio to a photo-realistic speaking video of a specific person, where the output video has synchronized, realistic, and expressive rich body dynamics. We achieve this by first generating 3D skeleton movements from the audio sequence using a recurrent neural network (RNN), and then synthesizing the output video via a conditional generative adversarial network (GAN). To make the skeleton movement realistic and expressive, we embed the knowledge of an articulated 3D human skeleton and a learned dictionary of personal speech iconic gestures into the generation process in both learning and testing pipelines. The former prevents the generation of unreasonable body distortion, while the later helps our model quickly learn meaningful body movement through a few recorded videos. To produce photo-realistic and high-resolution…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sibozhang/Speech2Video
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation