Large-vocabulary Audio-visual Speech Recognition in Noisy Environments

Wentao Yu; Steffen Zeiler; Dorothea Kolossa

arXiv:2109.04894·eess.AS·September 13, 2021

Large-vocabulary Audio-visual Speech Recognition in Noisy Environments

Wentao Yu, Steffen Zeiler, Dorothea Kolossa

PDF

Open Access

TL;DR

This paper introduces a novel recurrent fusion strategy for large-vocabulary audio-visual speech recognition that significantly improves recognition accuracy in noisy environments by effectively integrating multi-modal information.

Contribution

A new recurrent integration network for fusing multi-modal speech data, guided by reliability measures, outperforming existing methods and handling time-variant information in AVSR.

Findings

01

Achieves 42.18% relative WER reduction over audio-only models.

02

Outperforms oracle dynamic stream weighting in fusion.

03

Effective under clean, noisy, and reverberant conditions.

Abstract

Audio-visual speech recognition (AVSR) can effectively and significantly improve the recognition rates of small-vocabulary systems, compared to their audio-only counterparts. For large-vocabulary systems, however, there are still many difficulties, such as unsatisfactory video recognition accuracies, that make it hard to improve over audio-only baselines. In this paper, we specifically consider such scenarios, focusing on the large-vocabulary task of the LRS2 database, where audio-only performance is far superior to video-only accuracies, making this an interesting and challenging setup for multi-modal integration. To address the inherent difficulties, we propose a new fusion strategy: a recurrent integration network is trained to fuse the state posteriors of multiple single-modality models, guided by a set of model-based and signal-based stream reliability measures. During decoding,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing