XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale
Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong, Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan, Pino, Alexei Baevski, Alexis Conneau, Michael Auli

TL;DR
XLS-R is a large-scale, self-supervised cross-lingual speech model trained on 128 languages, significantly advancing speech translation, recognition, and language identification across diverse languages and data regimes.
Contribution
The paper introduces XLS-R, a massive cross-lingual speech model trained on half a million hours of data, achieving state-of-the-art results in multiple speech tasks and languages.
Findings
Improves speech translation BLEU scores by 7.4 on CoVoST-2.
Reduces speech recognition error rates by 14-34%.
Sets new state-of-the-art in language identification on VoxLingua107.
Abstract
This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. Our evaluation covers a wide range of tasks, domains, data regimes and languages, both high and low-resource. On the CoVoST-2 speech translation benchmark, we improve the previous state of the art by an average of 7.4 BLEU over 21 translation directions into English. For speech recognition, XLS-R improves over the best known prior work on BABEL, MLS, CommonVoice as well as VoxPopuli, lowering error rates by 14-34% relative on average. XLS-R also sets a new state of the art on VoxLingua107 language identification. Moreover, we show that with sufficient model size,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗espnet/xeusmodel· 33 dl· ♡ 14633 dl♡ 146
- 🤗aapot/wav2vec2-xlsr-1b-finnish-lm-v2model· 11 dl· ♡ 311 dl♡ 3
- 🤗aapot/wav2vec2-xlsr-1b-finnish-lmmodel· 1 dl1 dl
- 🤗aapot/wav2vec2-xlsr-1b-finnish-v2model· 18 dl18 dl
- 🤗aapot/wav2vec2-xlsr-1b-finnishmodel· 17 dl17 dl
- 🤗aapot/wav2vec2-xlsr-300m-finnish-lmmodel· 2 dl2 dl
- 🤗aapot/wav2vec2-xlsr-300m-finnishmodel· 3 dl3 dl
- 🤗bookbot/distil-wav2vec2-xls-r-adult-child-cls-64mmodel· 8 dl· ♡ 18 dl♡ 1
- 🤗bookbot/distil-wav2vec2-xls-r-adult-child-cls-89mmodel· 1 dl1 dl
- 🤗bookbot/wav2vec2-xls-r-adult-child-clsmodel· 310 dl310 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
