Efficient Multilingual ASR Finetuning via LoRA Language Experts
Jiahong Li, Yiwen Shao, Jianheng Zhuo, Chenda Li, Liliang Tang, Dong Yu, Yanmin Qian

TL;DR
This paper introduces a LoRA-based finetuning framework for multilingual ASR that improves recognition accuracy by effectively managing language interference, achieving significant relative performance gains over standard methods.
Contribution
It presents a novel LoRA expert fusion and knowledge distillation approach for efficient multilingual ASR finetuning, addressing the curse of multilinguality.
Findings
Achieves approximately 10% relative performance gain in language-aware scenarios
Achieves approximately 15% relative performance gain in language-agnostic scenarios
Demonstrates effectiveness on Whisper-based multilingual ASR models
Abstract
Recent advancements in deep learning have significantly enhanced multilingual automatic speech recognition (ASR) due to the development of advanced model architectures and available large-scale multilingual datasets. Despite that, multilingual ASR still suffers from the curse of multilinguality in that different languages tend to interfere with each other, making it difficult for the ASR model to identify multiple languages effectively while sharing model capacity across them. This paper proposes an efficient finetuning framework for customized multilingual ASR via prepared LoRA language experts based on Whisper. Through LoRA expert fusion or knowledge distillation, our approach achieves better recognition performance on target languages than standard fine-tuning methods. Experimental results demonstrate that the proposed models yield approximately 10\% and 15\% relative performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition
