Lip-Reading Driven Deep Learning Approach for Speech Enhancement
Ahsan Adeel, Mandar Gogate, Amir Hussain, William M. Whitmer

TL;DR
This paper introduces a novel audio-visual speech enhancement framework that combines deep learning-based lip-reading with an enhanced Wiener filter, significantly improving speech quality and intelligibility in noisy real-world scenarios.
Contribution
It presents a new deep learning lip-reading model and an enhanced visually-derived Wiener filter, integrating visual and acoustic modeling for superior speech enhancement.
Findings
Significant improvement in speech quality and intelligibility over benchmark methods.
Effective performance across various real-world noisy environments.
Demonstrated robustness at different SNR levels.
Abstract
This paper proposes a novel lip-reading driven deep learning framework for speech enhancement. The proposed approach leverages the complementary strengths of both deep learning and analytical acoustic modelling (filtering based approach) as compared to recently published, comparatively simpler benchmark approaches that rely only on deep learning. The proposed audio-visual (AV) speech enhancement framework operates at two levels. In the first level, a novel deep learning-based lip-reading regression model is employed. In the second level, lip-reading approximated clean-audio features are exploited, using an enhanced, visually-derived Wiener filter (EVWF), for the clean audio power spectrum estimation. Specifically, a stacked long-short-term memory (LSTM) based lip-reading regression model is designed for clean audio features estimation using only temporal visual features considering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
