Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for   Robust Audio-Visual Speech Recognition

Yuchen Hu; Ruizhe Li; Chen Chen; Chengwei Qin; Qiushi Zhu; Eng Siong; Chng

arXiv:2306.10563·eess.AS·June 21, 2023·1 cites

Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition

Yuchen Hu, Ruizhe Li, Chen Chen, Chengwei Qin, Qiushi Zhu, Eng Siong, Chng

PDF

Open Access 1 Repo

TL;DR

This paper introduces a universal viseme-phoneme mapping method that enhances audio-visual speech recognition robustness in noisy environments by enabling modality transfer without relying on noisy training data.

Contribution

The paper proposes a novel universal viseme-phoneme mapping approach for modality transfer, improving noise robustness in AVSR without dependence on noisy training data.

Findings

01

Achieves state-of-the-art results on LRS3 and LRS2 benchmarks.

02

Outperforms previous methods in noisy and clean conditions.

03

Demonstrates effective noise-invariant visual modality for AVSR.

Abstract

Audio-visual speech recognition (AVSR) provides a promising solution to ameliorate the noise-robustness of audio-only speech recognition with visual information. However, most existing efforts still focus on audio modality to improve robustness considering its dominance in AVSR task, with noise adaptation techniques such as front-end denoise processing. Though effective, these methods are usually faced with two practical challenges: 1) lack of sufficient labeled noisy audio-visual training data in some real-world scenarios and 2) less optimal model generality to unseen testing noises. In this work, we investigate the noise-invariant visual modality to strengthen robustness of AVSR, which can adapt to any testing noises while without dependence on noisy training data, a.k.a., unsupervised noise adaptation. Inspired by human perception mechanism, we propose a universal viseme-phoneme…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuchen005/univpm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Music and Audio Processing

MethodsFocus