MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition

Sungnyun Kim; Kangwook Jang; Sangmin Bae; Sungwoo Cho; Se-Young Yun

arXiv:2502.10447·eess.AS·May 22, 2025

MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition

Sungnyun Kim, Kangwook Jang, Sangmin Bae, Sungwoo Cho, Se-Young Yun

PDF

Open Access

TL;DR

MoHAVE introduces a scalable, efficient audio-visual speech recognition framework using a mixture-of-experts architecture with hierarchical gating, significantly improving robustness and performance on benchmark datasets.

Contribution

The paper presents MoHAVE, a novel scalable AVSR system employing a mixture-of-experts architecture with hierarchical gating for dynamic modality adaptation.

Findings

01

Achieves state-of-the-art results on LRS3 and MuAViC benchmarks.

02

Demonstrates efficient scalability with minimal computational overhead.

03

Enhances robustness and adaptability in noisy environments.

Abstract

Audio-visual speech recognition (AVSR) has become critical for enhancing speech recognition in noisy environments by integrating both auditory and visual modalities. However, existing AVSR systems struggle to scale up without compromising computational efficiency. In this study, we introduce MoHAVE (Mixture of Hierarchical Audio-Visual Experts), a novel robust AVSR framework designed to address these scalability constraints. By leveraging a Mixture-of-Experts (MoE) architecture, MoHAVE activates modality-specific expert groups, ensuring dynamic adaptation to various audio-visual inputs with minimal computational overhead. Key contributions of MoHAVE include: (1) a sparse MoE framework that efficiently scales AVSR model capacity, (2) a hierarchical gating mechanism that dynamically utilizes the expert groups based on input context, enhancing adaptability and robustness, and (3)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis

MethodsMixture of Experts