LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading
Yochai Yemini, Aviv Shamsian, Lior Bracha, Sharon Gannot, Ethan Fetaya

TL;DR
LipVoicer is a novel lip-to-speech method that leverages lip reading and text guidance via a diffusion model to generate highly intelligible, natural, and synchronized speech from silent videos, outperforming existing approaches.
Contribution
The paper introduces LipVoicer, integrating lip reading and text modality with a diffusion model for improved speech synthesis from silent videos, especially on challenging datasets.
Findings
Outperforms baselines on LRS2 and LRS3 datasets.
Significantly improves speech intelligibility and naturalness.
Reduces Word Error Rate (WER) in generated speech.
Abstract
Lip-to-speech involves generating a natural-sounding speech synchronized with a soundless video of a person talking. Despite recent advances, current methods still cannot produce high-quality speech with high levels of intelligibility for challenging and realistic datasets such as LRS3. In this work, we present LipVoicer, a novel method that generates high-quality speech, even for in-the-wild and rich datasets, by incorporating the text modality. Given a silent video, we first predict the spoken text using a pre-trained lip-reading network. We then condition a diffusion model on the video and use the extracted text through a classifier-guidance mechanism where a pre-trained ASR serves as the classifier. LipVoicer outperforms multiple lip-to-speech baselines on LRS2 and LRS3, which are in-the-wild datasets with hundreds of unique speakers in their test set and an unrestricted vocabulary.…
Peer Reviews
Decision·ICLR 2024 poster
The general structure is clear. The method is simple in general. It’s easy to follow. The performance is good, with a large margin over other methods. It’s also a nice try to include the predicted text into the learning process.
(1) I am a little confused with fig1.a. The output of the lipreading module is the predicted text. The output of the ASR modules is also the predicted text. There should be no connections from the output of the predicted text to the ASR module? The ASR module should take the output of MelGen as input? without the text predicted from LR module? (2) Lip2speech (Kim et al.(2023)) takes the ground-truth text as input to constrain the learning process and has shown the success of the role of text mod
1. LipVoicer greatly improves the intelligibility of the generated speech and outperforms existing lip-to-speech baselines on challenging datasets, demonstrating its superior performance. 2. The paper provides detailed implementation details, making it easier for others to reproduce and further improve upon the LipVoicer method. 3. By introducing a pre-trained ASR model, this paper realizes a good application of classifier-guidance diffusion model in lip2speech task.
1. After listening to Demo page, it is found that the gap between different models is mainly in sound quality. The baselines are too weak in sound quality. However, the problem of sound quality can be solved by many existing generative models based on VAE/GAN/FLOW model. If the sound quality problem of baselines is solved, the advantage of the model proposed in this paper may not be so great. 2. In previous studies, a very important motivation for lip2speech tasks was to dispense with text moda
- The key ideas are reasonable, and well-engineered combination of proven methods. - The use of pre-trained state-of-the-art lip reading model significantly lowers the WER significantly compared to existing methods. - The diffusion model generates natural-sounding output, according to the qualitative results reported.
- It is not clear if the performance improvement comes from the key improvements, or the replacement of the vocoder, which can be seen as a post-processing step rather than a key part of the algorithm. It is well known that DiffWave produces much more natural-sounding output compared to the Griffin-Lim algorithm used by the previous works. - The authors request subjective assessors to rate Intelligibility, Naturalness, Quality and Synchronisation, but it is not clear what the difference between
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis
Methodsclassifier-guidance · Diffusion · Test
