Dynamic Multi-Expert Projectors with Stabilized Routing for Multilingual Speech Recognition

Isha Pandey; Ashish Mittal; Vartul Bahuguna; Ganesh Ramakrishnan

arXiv:2601.19451·cs.CL·January 28, 2026

Dynamic Multi-Expert Projectors with Stabilized Routing for Multilingual Speech Recognition

Isha Pandey, Ashish Mittal, Vartul Bahuguna, Ganesh Ramakrishnan

PDF

Open Access 1 Models

TL;DR

This paper introduces SMEAR-MoE, a stabilized mixture-of-experts projector for multilingual speech recognition, which improves accuracy and maintains efficiency by enabling effective cross-lingual sharing and expert specialization.

Contribution

The paper proposes SMEAR-MoE, a novel stabilized MoE projector that prevents expert collapse and enhances multilingual ASR performance compared to traditional single-projector methods.

Findings

01

Achieves up to 7.6% relative WER reduction over baseline.

02

Experts show linguistically meaningful specialization.

03

Maintains comparable runtime efficiency.

Abstract

Recent advances in LLM-based ASR connect frozen speech encoders with Large Language Models (LLMs) via lightweight projectors. While effective in monolingual settings, a single projector struggles to capture the diverse acoustic-to-semantic mappings required for multilingual ASR. To address this, we propose SMEAR-MoE, a stabilized Mixture-of-Experts projector that ensures dense gradient flow to all experts, preventing expert collapse while enabling cross-lingual sharing. We systematically compare monolithic, static multi-projector, and dynamic MoE designs across four Indic languages (Hindi, Marathi, Tamil, Telugu). Our SMEAR-MoE achieves strong performance, delivering upto a 7.6% relative WER reduction over the single-projector baseline, while maintaining comparable runtime efficiency. Analysis of expert routing further shows linguistically meaningful specialization, with related…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
bharatgenai/Shrutam-2
model· ♡ 8
♡ 8

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research