AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person
Xinsheng Wang, Qicong Xie, Jihua Zhu, Lei Xie, Scharenborg

TL;DR
This paper introduces AnyoneNet, a method for generating synchronized speech and talking head videos for any person using only a face image, combining face-conditioned TTS and landmark-based head movement prediction.
Contribution
It presents a novel face-conditioned multi-speaker TTS model and a landmark-based head movement prediction method for arbitrary person video synthesis.
Findings
Able to generate synchronized speech and videos for any person
Synthesized speech matches face appearance and voice timbre
Outperforms state-of-the-art landmark-based methods
Abstract
Automatically generating videos in which synthesized speech is synchronized with lip movements in a talking head has great potential in many human-computer interaction scenarios. In this paper, we present an automatic method to generate synchronized speech and talking-head videos on the basis of text and a single face image of an arbitrary person as input. In contrast to previous text-driven talking head generation methods, which can only synthesize the voice of a specific person, the proposed method is capable of synthesizing speech for any person that is inaccessible in the training stage. Specifically, the proposed method decomposes the generation of synchronized speech and talking head videos into two stages, i.e., a text-to-speech (TTS) stage and a speech-driven talking head generation stage. The proposed TTS module is a face-conditioned multi-speaker TTS model that gets the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis
