Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a   Short Video

Xiuzhe Wu; Pengfei Hu; Yang Wu; Xiaoyang Lyu; Yan-Pei Cao; Ying Shan,; Wenming Yang; Zhongqian Sun; Xiaojuan Qi

arXiv:2309.04814·cs.CV·September 12, 2023

Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

Xiuzhe Wu, Pengfei Hu, Yang Wu, Xiaoyang Lyu, Yan-Pei Cao, Ying Shan,, Wenming Yang, Zhongqian Sun, Xiaojuan Qi

PDF

Open Access 1 Repo

TL;DR

Speech2Lip introduces a novel framework that disentangles speech-sensitive and insensitive motions to generate high-fidelity talking videos from limited training data, achieving state-of-the-art results.

Contribution

It proposes a decomposition-synthesis-composition framework with a speech-driven implicit model and a geometry-aware mapping for natural lip and head motion synthesis.

Findings

01

Achieves high visual quality and synchronization with limited training data.

02

Outperforms existing methods on three benchmark datasets.

03

Can generate realistic talking videos with arbitrary head poses.

Abstract

Synthesizing realistic videos according to a given speech is still an open challenge. Previous works have been plagued by issues such as inaccurate lip shape generation and poor image quality. The key reason is that only motions and appearances on limited facial areas (e.g., lip area) are mainly driven by the input speech. Therefore, directly learning a mapping function from speech to the entire head image is prone to ambiguity, particularly when using a short video for training. We thus propose a decomposition-synthesis-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance to facilitate effective learning from limited training data, resulting in the generation of natural-looking videos. First, given a fixed head pose (i.e., canonical space), we present a speech-driven implicit model for lip image generation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cvmi-lab/speech2lip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis