Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer
Maxime Burchi, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg,, Radu Timofte

TL;DR
This paper introduces a multilingual audio-visual speech recognition model using a hybrid CTC/RNN-T Fast Conformer architecture, achieving state-of-the-art results and robustness in noisy conditions across multiple languages.
Contribution
It adapts the Fast Conformer model for multimodal speech recognition with a novel hybrid CTC/RNN-T approach and expands training data with automatically transcribed multilingual datasets.
Findings
Achieves 0.8% WER on LRS3 dataset.
Reduces average WER by 11.9% on MuAViC benchmark.
Performs well in audio-only, visual-only, and combined modes.
Abstract
Humans are adept at leveraging visual cues from lip movements for recognizing speech in adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow similar approach to achieve robust speech recognition in noisy conditions. In this work, we present a multilingual AVSR model incorporating several enhancements to improve performance and audio noise robustness. Notably, we adapt the recently proposed Fast Conformer model to process both audio and visual modalities using a novel hybrid CTC/RNN-T architecture. We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets (VoxCeleb2 and AVSpeech). Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%. On the recently introduced MuAViC benchmark, our model yields an absolute…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques
