mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech   Recognition

Andrew Rouditchenko; Samuel Thomas; Hilde Kuehne; Rogerio Feris; James; Glass

arXiv:2502.01547·eess.AS·May 8, 2025

mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition

Andrew Rouditchenko, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James, Glass

PDF

Open Access 1 Repo

TL;DR

This paper introduces mWhisper-Flamingo, a multilingual audio-visual speech recognition model that combines pre-trained audio and video models, improving noise robustness and achieving state-of-the-art results across nine languages.

Contribution

It presents a novel multimodal integration method with decoder modality dropout, enhancing multilingual AVSR performance in noisy environments.

Findings

01

Achieves state-of-the-art WER on MuAViC dataset.

02

Outperforms audio-only Whisper in noisy conditions across all tested languages.

03

Demonstrates effective multilingual audio-visual speech recognition.

Abstract

Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can improve performance in noise, but most methods are trained only on English data. One limitation is the lack of large-scale multilingual video data, which makes it hard to train models from scratch. In this work, we propose mWhisper-Flamingo for multilingual AVSR which combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT). To enable better multi-modal integration and improve the noisy multilingual performance, we introduce decoder modality dropout where the model is trained both on paired audio-visual inputs and separate audio/visual inputs. mWhisper-Flamingo achieves state-of-the-art WER on MuAViC, an AVSR dataset of 9 languages. Audio-visual mWhisper-Flamingo consistently outperforms audio-only Whisper on all languages in noisy conditions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

roudimit/whisper-flamingo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsDropout