Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly

TL;DR
The paper introduces Omni-router, a shared routing mechanism for sparse MoE models in speech recognition, leading to better expert cooperation, lower error rates, and enhanced robustness across diverse datasets.
Contribution
It proposes a shared router across MoE layers to improve expert collaboration and specialization in speech recognition models.
Findings
Achieves 11.2% reduction in word error rate compared to dense models.
Outperforms Switch Transformer with 8.2% lower word error rate.
Demonstrates improved robustness across 10 out-of-domain benchmarks.
Abstract
Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model Omni-router Transformer. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Mobile Crowdsensing and Crowdsourcing
