Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models

Jeongsoo Choi; Minsu Kim; Se Jin Park; Yong Man Ro

arXiv:2306.16003·cs.GR·January 19, 2024

Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models

Jeongsoo Choi, Minsu Kim, Se Jin Park, Yong Man Ro

PDF

Open Access

TL;DR

This paper introduces a novel method to reprogram pre-trained audio-driven talking face models to generate face videos from text inputs, eliminating the need for speech recordings during inference.

Contribution

It proposes a Text-to-Audio Embedding Module (TAEM) that maps text into the audio latent space, incorporating speaker characteristics from a single face image, enabling flexible text-driven face synthesis.

Findings

01

Effective text-to-face video generation demonstrated

02

Compatible with various pre-trained audio-driven models

03

High-quality face videos from text inputs achieved

Abstract

In this paper, we present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner. Consequently, we can easily generate face videos that articulate the provided textual sentences, eliminating the necessity of recording speech for each inference, as required in the audio-driven model. To this end, we propose to embed the input text into the learned audio latent space of the pre-trained audio-driven model, while preserving the face synthesis capability of the original pre-trained model. Specifically, we devise a Text-to-Audio Embedding Module (TAEM) which maps a given text input into the audio latent space by modeling pronunciation and duration characteristics. Furthermore, to consider the speaker characteristics in audio while using text inputs, TAEM is designed to accept a visual speaker embedding. The visual speaker embedding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis