Scaling Speech Technology to 1,000+ Languages

Vineel Pratap; Andros Tjandra; Bowen Shi; Paden Tomasello; Arun Babu,; Sayani Kundu; Ali Elkahky; Zhaoheng Ni; Apoorv Vyas; Maryam Fazel-Zarandi,; Alexei Baevski; Yossi Adi; Xiaohui Zhang; Wei-Ning Hsu; Alexis Conneau,; Michael Auli

arXiv:2305.13516·cs.CL·May 24, 2023·115 cites

Scaling Speech Technology to 1,000+ Languages

Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu,, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi,, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau,, Michael Auli

PDF

Open Access 4 Repos 10 Models 3 Datasets

TL;DR

This paper presents a large-scale multilingual speech technology project that supports over 1,000 languages using new datasets and self-supervised learning, significantly expanding language coverage and improving recognition accuracy.

Contribution

It introduces a comprehensive approach to scaling speech models to over a thousand languages, including new datasets, models, and training techniques leveraging self-supervised learning.

Findings

01

Multilingual speech recognition model halves the word error rate of Whisper on 54 languages.

02

Supported 1,406 languages with pre-trained wav2vec 2.0 models.

03

Built speech synthesis and language identification models for over 1,000 languages.

Abstract

Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques