Lip-to-Speech Synthesis in the Wild with Multi-task Learning

Minsu Kim; Joanna Hong; Yong Man Ro

arXiv:2302.08841·cs.SD·February 20, 2023

Lip-to-Speech Synthesis in the Wild with Multi-task Learning

Minsu Kim, Joanna Hong, Yong Man Ro

PDF

Open Access 3 Repos

TL;DR

This paper introduces a multi-task learning approach for lip-to-speech synthesis that effectively reconstructs accurate speech content from lip movements in wild environments by leveraging multimodal supervision.

Contribution

It proposes a novel Lip2Speech framework using multimodal supervision to improve speech accuracy in unconstrained settings, addressing limitations of prior methods.

Findings

01

Effective speech reconstruction in wild environments.

02

Able to synthesize speech with correct content from multiple speakers.

03

Validated on LRS2, LRS3, and LRW datasets.

Abstract

Recent studies have shown impressive performance in Lip-to-speech synthesis that aims to reconstruct speech from visual information alone. However, they have been suffering from synthesizing accurate speech in the wild, due to insufficient supervision for guiding the model to infer the correct content. Distinct from the previous methods, in this paper, we develop a powerful Lip2Speech method that can reconstruct speech with correct contents from the input lip movements, even in a wild environment. To this end, we design multi-task learning that guides the model using multimodal supervision, i.e., text and audio, to complement the insufficient word representations of acoustic feature reconstruction loss. Thus, the proposed framework brings the advantage of synthesizing speech containing the right content of multiple speakers with unconstrained sentences. We verify the effectiveness of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis