Let There Be Sound: Reconstructing High Quality Speech from Silent Videos
Ji-Hoon Kim, Jaehun Kim, Joon Son Chung

TL;DR
This paper introduces a novel lip-to-speech system that leverages self-supervised speech representations and acoustic variance modeling to generate high-quality, natural, and intelligible speech from silent videos, addressing the one-to-many mapping challenge.
Contribution
The work proposes a new lip-to-speech approach that incorporates self-supervised representations and a flow-based post-net to improve speech quality and diversity from silent lip movements.
Findings
Achieves speech quality close to real human utterance.
Outperforms existing methods in naturalness and intelligibility.
Demonstrates effectiveness on two datasets.
Abstract
The goal of this work is to reconstruct high quality speech from lip motions alone, a task also known as lip-to-speech. A key challenge of lip-to-speech systems is the one-to-many mapping caused by (1) the existence of homophenes and (2) multiple speech variations, resulting in a mispronounced and over-smoothed speech. In this paper, we propose a novel lip-to-speech system that significantly improves the generation quality by alleviating the one-to-many mapping problem from multiple perspectives. Specifically, we incorporate (1) self-supervised speech representations to disambiguate homophenes, and (2) acoustic variance information to model diverse speech styles. Additionally, to better solve the aforementioned problem, we employ a flow based post-net which captures and refines the details of the generated speech. We perform extensive experiments on two datasets, and demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Video Analysis and Summarization
