Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition
Yuchen Hu, Ruizhe Li, Chen Chen, Chengwei Qin, Qiushi Zhu, Eng Siong, Chng

TL;DR
This paper introduces a universal viseme-phoneme mapping method that enhances audio-visual speech recognition robustness in noisy environments by enabling modality transfer without relying on noisy training data.
Contribution
The paper proposes a novel universal viseme-phoneme mapping approach for modality transfer, improving noise robustness in AVSR without dependence on noisy training data.
Findings
Achieves state-of-the-art results on LRS3 and LRS2 benchmarks.
Outperforms previous methods in noisy and clean conditions.
Demonstrates effective noise-invariant visual modality for AVSR.
Abstract
Audio-visual speech recognition (AVSR) provides a promising solution to ameliorate the noise-robustness of audio-only speech recognition with visual information. However, most existing efforts still focus on audio modality to improve robustness considering its dominance in AVSR task, with noise adaptation techniques such as front-end denoise processing. Though effective, these methods are usually faced with two practical challenges: 1) lack of sufficient labeled noisy audio-visual training data in some real-world scenarios and 2) less optimal model generality to unseen testing noises. In this work, we investigate the noise-invariant visual modality to strengthen robustness of AVSR, which can adapt to any testing noises while without dependence on noisy training data, a.k.a., unsupervised noise adaptation. Inspired by human perception mechanism, we propose a universal viseme-phoneme…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Music and Audio Processing
MethodsFocus
