Visual Speech Enhancement Without A Real Visual Stream
Sindhu B Hegde, K R Prajwal, Rudrabha Mukhopadhyay, Vinay Namboodiri,, C.V. Jawahar

TL;DR
This paper introduces a novel speech enhancement method that uses a pseudo-lip model to generate visual cues for noise reduction, effective even without real video input, matching the performance of real lip-based methods.
Contribution
The paper presents a new paradigm for speech enhancement by synthesizing lip movements from audio, enabling visual noise filtering without real visual streams.
Findings
Pseudo-lip approach achieves speech intelligibility within 3% of real lip methods.
Model performs well across various real-world noise conditions.
Code and models are publicly available for future research.
Abstract
In this work, we re-think the task of speech enhancement in unconstrained real-world environments. Current state-of-the-art methods use only the audio stream and are limited in their performance in a wide range of real-world noises. Recent works using lip movements as additional cues improve the quality of generated speech over "audio-only" methods. But, these methods cannot be used for several applications where the visual stream is unreliable or completely absent. We propose a new paradigm for speech enhancement by exploiting recent breakthroughs in speech-driven lip synthesis. Using one such model as a teacher network, we train a robust student network to produce accurate lip movements that mask away the noise, thus acting as a "visual noise filter". The intelligibility of the speech enhanced by our pseudo-lip approach is comparable (< 3% difference) to the case of using real lips.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Music Technology and Sound Studies
