Text2Video: Text-driven Talking-head Video Synthesis with Personalized   Phoneme-Pose Dictionary

Sibo Zhang; Jiahong Yuan; Miao Liao; Liangjun Zhang

arXiv:2104.14631·cs.CV·January 25, 2022·1 cites

Text2Video: Text-driven Talking-head Video Synthesis with Personalized Phoneme-Pose Dictionary

Sibo Zhang, Jiahong Yuan, Miao Liao, Liangjun Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel text-to-video synthesis method that uses a phoneme-pose dictionary and GANs, requiring less data and offering greater flexibility compared to audio-driven approaches.

Contribution

It proposes a new text-driven video synthesis framework with a phoneme-pose dictionary and GAN, reducing data needs and improving robustness over existing audio-based methods.

Findings

01

Outperforms state-of-the-art methods on benchmark datasets

02

Requires less training data and preprocessing

03

Demonstrates higher flexibility and robustness

Abstract

With the advance of deep learning technology, automatic video generation from audio or text has become an emerging and promising research topic. In this paper, we present a novel approach to synthesize video from the text. The method builds a phoneme-pose dictionary and trains a generative adversarial network (GAN) to generate video from interpolated phoneme poses. Compared to audio-driven video generation algorithms, our approach has a number of advantages: 1) It only needs a fraction of the training data used by an audio-driven approach; 2) It is more flexible and not subject to vulnerability due to speaker variation; 3) It significantly reduces the preprocessing, training and inference time. We perform extensive experiments to compare the proposed method with state-of-the-art talking face generation methods on a benchmark dataset and datasets of our own. The results demonstrate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sibozhang/Text2Video
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Speech and Audio Processing · Face recognition and analysis