Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast   Conformer

Maxime Burchi; Krishna C. Puvvada; Jagadeesh Balam; Boris Ginsburg,; Radu Timofte

arXiv:2405.12983·eess.AS·May 24, 2024

Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer

Maxime Burchi, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg,, Radu Timofte

PDF

Open Access

TL;DR

This paper introduces a multilingual audio-visual speech recognition model using a hybrid CTC/RNN-T Fast Conformer architecture, achieving state-of-the-art results and robustness in noisy conditions across multiple languages.

Contribution

It adapts the Fast Conformer model for multimodal speech recognition with a novel hybrid CTC/RNN-T approach and expands training data with automatically transcribed multilingual datasets.

Findings

01

Achieves 0.8% WER on LRS3 dataset.

02

Reduces average WER by 11.9% on MuAViC benchmark.

03

Performs well in audio-only, visual-only, and combined modes.

Abstract

Humans are adept at leveraging visual cues from lip movements for recognizing speech in adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow similar approach to achieve robust speech recognition in noisy conditions. In this work, we present a multilingual AVSR model incorporating several enhancements to improve performance and audio noise robustness. Notably, we adapt the recently proposed Fast Conformer model to process both audio and visual modalities using a novel hybrid CTC/RNN-T architecture. We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets (VoxCeleb2 and AVSpeech). Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%. On the recently introduced MuAViC benchmark, our model yields an absolute…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques