AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary   Person

Xinsheng Wang; Qicong Xie; Jihua Zhu; Lei Xie; Scharenborg

arXiv:2108.04325·cs.CV·August 29, 2021

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person

Xinsheng Wang, Qicong Xie, Jihua Zhu, Lei Xie, Scharenborg

PDF

Open Access

TL;DR

This paper introduces AnyoneNet, a method for generating synchronized speech and talking head videos for any person using only a face image, combining face-conditioned TTS and landmark-based head movement prediction.

Contribution

It presents a novel face-conditioned multi-speaker TTS model and a landmark-based head movement prediction method for arbitrary person video synthesis.

Findings

01

Able to generate synchronized speech and videos for any person

02

Synthesized speech matches face appearance and voice timbre

03

Outperforms state-of-the-art landmark-based methods

Abstract

Automatically generating videos in which synthesized speech is synchronized with lip movements in a talking head has great potential in many human-computer interaction scenarios. In this paper, we present an automatic method to generate synchronized speech and talking-head videos on the basis of text and a single face image of an arbitrary person as input. In contrast to previous text-driven talking head generation methods, which can only synthesize the voice of a specific person, the proposed method is capable of synthesizing speech for any person that is inaccessible in the training stage. Specifically, the proposed method decomposes the generation of synchronized speech and talking head videos into two stages, i.e., a text-to-speech (TTS) stage and a speech-driven talking head generation stage. The proposed TTS module is a face-conditioned multi-speaker TTS model that gets the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis