Master-ASR: Achieving Multilingual Scalability and Low-Resource Adaptation in ASR with Modular Learning
Zhongzhi Yu, Yang Zhang, Kaizhi Qian, Yonggan Fu, Yingyan Lin

TL;DR
Master-ASR introduces a modular learning framework that enhances multilingual scalability and low-resource adaptation in automatic speech recognition by sharing and assembling language-specific modules, outperforming state-of-the-art methods.
Contribution
It proposes a novel modular ASR framework that simultaneously improves multilingual scalability and low-resource adaptation through a learnable, assemble-then-share strategy.
Findings
Achieves 0.13-2.41 lower CER on multilingual ASR with 30% less inference overhead.
Performs nearly 50 times fewer trainable parameters in low-resource tuning.
Effectively discovers language similarities and enhances performance over SOTA methods.
Abstract
Despite the impressive performance recently achieved by automatic speech recognition (ASR), we observe two primary challenges that hinder its broader applications: (1) The difficulty of introducing scalability into the model to support more languages with limited training, inference, and storage overhead; (2) The low-resource adaptation ability that enables effective low-resource adaptation while avoiding over-fitting and catastrophic forgetting issues. Inspired by recent findings, we hypothesize that we can address the above challenges with modules widely shared across languages. To this end, we propose an ASR framework, dubbed \METHODNS, that, \textit{for the first time}, simultaneously achieves strong multilingual scalability and low-resource adaptation ability thanks to its modularize-then-assemble strategy. Specifically, \METHOD learns a small set of generalizable sub-modules and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
