Contextual Audio-Visual Switching For Speech Enhancement in Real-World Environments
Ahsan Adeel, Mandar Gogate, Amir Hussain

TL;DR
This paper introduces a context-aware audio-visual switching system for speech enhancement that adaptively combines visual and audio cues based on noise levels, improving speech quality in real-world noisy environments.
Contribution
It proposes a novel AV switching component that dynamically utilizes visual, audio, or both cues without SNR estimation, enhancing speech enhancement performance across varying noise conditions.
Findings
Outperforms audio-only and visual-only methods at different SNRs.
Effectively handles spectro-temporal variations in real-world noise.
Demonstrates superior perceptual and subjective speech quality improvements.
Abstract
Human speech processing is inherently multimodal, where visual cues (lip movements) help to better understand the speech in noise. Lip-reading driven speech enhancement significantly outperforms benchmark audio-only approaches at low signal-to-noise ratios (SNRs). However, at high SNRs or low levels of background noise, visual cues become fairly less effective for speech enhancement. Therefore, a more optimal, context-aware audio-visual (AV) system is required, that contextually utilises both visual and noisy audio features and effectively accounts for different noisy conditions. In this paper, we introduce a novel contextual AV switching component that contextually exploits AV cues with respect to different operating conditions to estimate clean audio, without requiring any SNR estimation. The switching module switches between visual-only (V-only), audio-only (A-only), and both AV cues…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
