Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition
Wonjun Lee, Hyounghun Kim, Gary Geunbae Lee

TL;DR
This paper introduces Moe-Ctc, a Mixture-of-Experts model with intermediate CTC supervision, improving accented speech recognition by promoting expert specialization and generalization, leading to significant WER reductions.
Contribution
The paper proposes Moe-Ctc, a novel Mixture-of-Experts architecture with accent-aware routing and intermediate CTC supervision for robust accented speech recognition.
Findings
Achieves up to 29.3% relative WER reduction on Mcv-Accent benchmark.
Demonstrates improved performance on both seen and unseen accents.
Outperforms strong FastConformer baselines.
Abstract
Accented speech remains a persistent challenge for automatic speech recognition (ASR), as most models are trained on data dominated by a few high-resource English varieties, leading to substantial performance degradation for other accents. Accent-agnostic approaches improve robustness yet struggle with heavily accented or unseen varieties, while accent-specific methods rely on limited and often noisy labels. We introduce Moe-Ctc, a Mixture-of-Experts architecture with intermediate CTC supervision that jointly promotes expert specialization and generalization. During training, accent-aware routing encourages experts to capture accent-specific patterns, which gradually transitions to label-free routing for inference. Each expert is equipped with its own CTC head to align routing with transcription quality, and a routing-augmented loss further stabilizes optimization. Experiments on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing
