Text2Video: Text-driven Talking-head Video Synthesis with Personalized Phoneme-Pose Dictionary
Sibo Zhang, Jiahong Yuan, Miao Liao, Liangjun Zhang

TL;DR
This paper introduces a novel text-to-video synthesis method that uses a phoneme-pose dictionary and GANs, requiring less data and offering greater flexibility compared to audio-driven approaches.
Contribution
It proposes a new text-driven video synthesis framework with a phoneme-pose dictionary and GAN, reducing data needs and improving robustness over existing audio-based methods.
Findings
Outperforms state-of-the-art methods on benchmark datasets
Requires less training data and preprocessing
Demonstrates higher flexibility and robustness
Abstract
With the advance of deep learning technology, automatic video generation from audio or text has become an emerging and promising research topic. In this paper, we present a novel approach to synthesize video from the text. The method builds a phoneme-pose dictionary and trains a generative adversarial network (GAN) to generate video from interpolated phoneme poses. Compared to audio-driven video generation algorithms, our approach has a number of advantages: 1) It only needs a fraction of the training data used by an audio-driven approach; 2) It is more flexible and not subject to vulnerability due to speaker variation; 3) It significantly reduces the preprocessing, training and inference time. We perform extensive experiments to compare the proposed method with state-of-the-art talking face generation methods on a benchmark dataset and datasets of our own. The results demonstrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Speech and Audio Processing · Face recognition and analysis
