XLS-R: Self-supervised Cross-lingual Speech Representation Learning at   Scale

Arun Babu; Changhan Wang; Andros Tjandra; Kushal Lakhotia; Qiantong; Xu; Naman Goyal; Kritika Singh; Patrick von Platen; Yatharth Saraf; Juan; Pino; Alexei Baevski; Alexis Conneau; Michael Auli

arXiv:2111.09296·cs.CL·December 17, 2021

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong, Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan, Pino, Alexei Baevski, Alexis Conneau, Michael Auli

PDF

Open Access 2 Repos 10 Models

TL;DR

XLS-R is a large-scale, self-supervised cross-lingual speech model trained on 128 languages, significantly advancing speech translation, recognition, and language identification across diverse languages and data regimes.

Contribution

The paper introduces XLS-R, a massive cross-lingual speech model trained on half a million hours of data, achieving state-of-the-art results in multiple speech tasks and languages.

Findings

01

Improves speech translation BLEU scores by 7.4 on CoVoST-2.

02

Reduces speech recognition error rates by 14-34%.

03

Sets new state-of-the-art in language identification on VoxLingua107.

Abstract

This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. Our evaluation covers a wide range of tasks, domains, data regimes and languages, both high and low-resource. On the CoVoST-2 speech translation benchmark, we improve the previous state of the art by an average of 7.4 BLEU over 21 translation directions into English. For speech recognition, XLS-R improves over the best known prior work on BABEL, MLS, CommonVoice as well as VoxPopuli, lowering error rates by 14-34% relative on average. XLS-R also sets a new state of the art on VoxLingua107 language identification. Moreover, we show that with sufficient model size,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing