Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters
Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Hannun, Vitaliy, Liptchinsky, Gabriel Synnaeve, Ronan Collobert

TL;DR
This paper demonstrates that training a single multilingual ASR model on over 50 languages and 16,000 hours of audio significantly improves recognition accuracy, especially for low-resource languages, simplifying deployment.
Contribution
It introduces the first large-scale multilingual ASR benchmark with over 50 languages and compares different training variants to enhance low-resource language recognition.
Findings
Multilingual training reduces WER by up to 28.8% on average.
Using language input or multiple heads improves performance over a joint model.
Massive-scale multilingual ASR outperforms monolingual baselines.
Abstract
We study training a single acoustic model for multiple languages with the aim of improving automatic speech recognition (ASR) performance on low-resource languages, and over-all simplifying deployment of ASR systems that support diverse languages. We perform an extensive benchmark on 51 languages, with varying amount of training data by language(from 100 hours to 1100 hours). We compare three variants of multilingual training from a single joint model without knowing the input language, to using this information, to multiple heads (one per language cluster). We show that multilingual training of ASR models on several languages can improve recognition performance, in particular, on low resource languages. We see 20.9%, 23% and 28.8% average WER relative reduction compared to monolingual baselines on joint model, joint model with language input and multi head model respectively. To our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
