Text-to-Video: a Two-stage Framework for Zero-shot Identity-agnostic   Talking-head Generation

Zhichao Wang; Mengyu Dai; Keld Lundgaard

arXiv:2308.06457·cs.CV·August 15, 2023·2 cites

Text-to-Video: a Two-stage Framework for Zero-shot Identity-agnostic Talking-head Generation

Zhichao Wang, Mengyu Dai, Keld Lundgaard

PDF

Open Access 1 Repo

TL;DR

This paper introduces a two-stage zero-shot framework for identity-agnostic talking-head video generation, combining text-to-speech conversion with audio-driven video synthesis to enable flexible, person-independent video creation from text.

Contribution

It proposes a novel two-stage framework that integrates pretrained zero-shot models for text-to-speech and talking head generation, enabling identity-agnostic video synthesis.

Findings

01

Compared different TTS and talking head methods to identify the best approach.

02

Demonstrated the effectiveness of the two-stage framework with sample videos.

03

Provided a public repository for samples and further research.

Abstract

The advent of ChatGPT has introduced innovative methods for information gathering and analysis. However, the information provided by ChatGPT is limited to text, and the visualization of this information remains constrained. Previous research has explored zero-shot text-to-video (TTV) approaches to transform text into videos. However, these methods lacked control over the identity of the generated audio, i.e., not identity-agnostic, hindering their effectiveness. To address this limitation, we propose a novel two-stage framework for person-agnostic video cloning, specifically focusing on TTV generation. In the first stage, we leverage pretrained zero-shot models to achieve text-to-speech (TTS) conversion. In the second stage, an audio-driven talking head generation method is employed to produce compelling videos privided the audio generated in the first stage. This paper presents a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhichaowang970201/text-to-video
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies · Speech and Audio Processing