Lip-to-Speech Synthesis in the Wild with Multi-task Learning
Minsu Kim, Joanna Hong, Yong Man Ro

TL;DR
This paper introduces a multi-task learning approach for lip-to-speech synthesis that effectively reconstructs accurate speech content from lip movements in wild environments by leveraging multimodal supervision.
Contribution
It proposes a novel Lip2Speech framework using multimodal supervision to improve speech accuracy in unconstrained settings, addressing limitations of prior methods.
Findings
Effective speech reconstruction in wild environments.
Able to synthesize speech with correct content from multiple speakers.
Validated on LRS2, LRS3, and LRW datasets.
Abstract
Recent studies have shown impressive performance in Lip-to-speech synthesis that aims to reconstruct speech from visual information alone. However, they have been suffering from synthesizing accurate speech in the wild, due to insufficient supervision for guiding the model to infer the correct content. Distinct from the previous methods, in this paper, we develop a powerful Lip2Speech method that can reconstruct speech with correct contents from the input lip movements, even in a wild environment. To this end, we design multi-task learning that guides the model using multimodal supervision, i.e., text and audio, to complement the insufficient word representations of acoustic feature reconstruction loss. Thus, the proposed framework brings the advantage of synthesizing speech containing the right content of multiple speakers with unconstrained sentences. We verify the effectiveness of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis
