Robust Audiovisual Speech Recognition Models with Mixture-of-Experts

Yihan Wu; Yifan Peng; Yichen Lu; Xuankai Chang; Ruihua Song; Shinji; Watanabe

arXiv:2409.12370·eess.AS·September 20, 2024

Robust Audiovisual Speech Recognition Models with Mixture-of-Experts

Yihan Wu, Yifan Peng, Yichen Lu, Xuankai Chang, Ruihua Song, Shinji, Watanabe

PDF

Open Access

TL;DR

This paper presents EVA, a mixture-of-experts audiovisual speech recognition model that enhances robustness and generalization across diverse video scenarios, achieving state-of-the-art results on multiple benchmarks.

Contribution

Introduces EVA, a novel mixture-of-experts based audiovisual speech recognition model that effectively incorporates visual signals for robust 'in-the-wild' speech recognition.

Findings

01

Achieves state-of-the-art results on three benchmarks.

02

Demonstrates strong generalization across diverse video domains.

03

Effectively incorporates visual information into speech recognition.

Abstract

Visual signals can enhance audiovisual speech recognition accuracy by providing additional contextual information. Given the complexity of visual signals, an audiovisual speech recognition model requires robust generalization capabilities across diverse video scenarios, presenting a significant challenge. In this paper, we introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for ``in-the-wild'' videos. Specifically, we first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection. Then, we build EVA upon a robust pretrained speech recognition model, ensuring its generalization ability. Moreover, to incorporate visual information effectively, we inject visual information into the ASR model through a mixture-of-experts module. Experiments show our model achieves state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing