Scaling Speech Technology to 1,000+ Languages
Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu,, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi,, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau,, Michael Auli

TL;DR
This paper presents a large-scale multilingual speech technology project that supports over 1,000 languages using new datasets and self-supervised learning, significantly expanding language coverage and improving recognition accuracy.
Contribution
It introduces a comprehensive approach to scaling speech models to over a thousand languages, including new datasets, models, and training techniques leveraging self-supervised learning.
Findings
Multilingual speech recognition model halves the word error rate of Whisper on 54 languages.
Supported 1,406 languages with pre-trained wav2vec 2.0 models.
Built speech synthesis and language identification models for over 1,000 languages.
Abstract
Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗facebook/mms-ttsmodel· ♡ 183♡ 183
- 🤗facebook/mms-tts-uzb-script_cyrillicmodel· 995 dl· ♡ 9995 dl♡ 9
- 🤗facebook/mms-tts-sommodel· 403 dl· ♡ 7403 dl♡ 7
- 🤗facebook/mms-tts-myamodel· 982 dl· ♡ 8982 dl♡ 8
- 🤗espnet/xeusmodel· 33 dl· ♡ 14633 dl♡ 146
- 🤗BrianMwangi/African-Kikuyu-TTSmodel· 7.2k dl· ♡ 57.2k dl♡ 5
- 🤗facebook/mms-300mmodel· 20k dl· ♡ 3620k dl♡ 36
- 🤗facebook/mms-1bmodel· 12k dl· ♡ 5612k dl♡ 56
- 🤗facebook/mms-1b-allmodel· 1.3M dl· ♡ 1891.3M dl♡ 189
- 🤗facebook/mms-1b-l1107model· 558 dl· ♡ 11558 dl♡ 11
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
