Large-vocabulary Audio-visual Speech Recognition in Noisy Environments
Wentao Yu, Steffen Zeiler, Dorothea Kolossa

TL;DR
This paper introduces a novel recurrent fusion strategy for large-vocabulary audio-visual speech recognition that significantly improves recognition accuracy in noisy environments by effectively integrating multi-modal information.
Contribution
A new recurrent integration network for fusing multi-modal speech data, guided by reliability measures, outperforming existing methods and handling time-variant information in AVSR.
Findings
Achieves 42.18% relative WER reduction over audio-only models.
Outperforms oracle dynamic stream weighting in fusion.
Effective under clean, noisy, and reverberant conditions.
Abstract
Audio-visual speech recognition (AVSR) can effectively and significantly improve the recognition rates of small-vocabulary systems, compared to their audio-only counterparts. For large-vocabulary systems, however, there are still many difficulties, such as unsatisfactory video recognition accuracies, that make it hard to improve over audio-only baselines. In this paper, we specifically consider such scenarios, focusing on the large-vocabulary task of the LRS2 database, where audio-only performance is far superior to video-only accuracies, making this an interesting and challenging setup for multi-modal integration. To address the inherent difficulties, we propose a new fusion strategy: a recurrent integration network is trained to fuse the state posteriors of multiple single-modality models, guided by a set of model-based and signal-based stream reliability measures. During decoding,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
