A Language-Agnostic Hierarchical LoRA-MoE Architecture for CTC-based Multilingual ASR
Yuang Zheng, Dongxu Chen, Yuxiang Mei, Dongxing Xu, Jie Chen, Yanhua Long

TL;DR
This paper introduces a lightweight, language-agnostic hierarchical LoRA-MoE architecture for multilingual ASR that improves decoding efficiency and removes the need for prior language information, suitable for resource-constrained devices.
Contribution
It presents a novel hierarchical LoRA-MoE framework integrated into an mHuBERT-CTC model, enabling true language-agnostic decoding without explicit language labels.
Findings
Achieves comparable performance to two-stage inference methods.
Reduces real-time factor (RTF) by 11.7% and 8.2%.
Demonstrates effectiveness on MSR-86K and MLC-SLM datasets.
Abstract
Large-scale multilingual ASR (mASR) models such as Whisper achieve strong performance but incur high computational and latency costs, limiting their deployment on resource-constrained edge devices. In this study, we propose a lightweight and language-agnostic multilingual ASR system based on a CTC architecture with domain adaptation. Specifically, we introduce a Language-agnostic Hierarchical LoRA-MoE (HLoRA) framework integrated into an mHuBERT-CTC model, enabling end-to-end decoding via LID-posterior-driven LoRA routing. The hierarchical design consists of a multilingual shared LoRA for learning language-invariant acoustic representations and language-specific LoRA experts for modeling language-dependent characteristics. The proposed routing mechanism removes the need for prior language identity information or explicit language labels during inference, achieving true language-agnostic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Domain Adaptation and Few-Shot Learning
