Towards Robust Speech Representation Learning for Thousands of Languages
William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian,, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, Shinji Watanabe

TL;DR
XEUS is a large-scale, multilingual speech representation model trained on over 1 million hours of data from 4057 languages, improving robustness and setting new state-of-the-art results across various speech benchmarks.
Contribution
The paper introduces XEUS, a novel cross-lingual SSL model trained on unprecedented multilingual data, with a new dereverberation objective to enhance robustness, significantly expanding language coverage and performance.
Findings
XEUS outperforms existing SSL models on multiple benchmarks.
It achieves new state-of-the-art results on ML-SUPERB.
The model demonstrates robustness across diverse multilingual speech conditions.
Abstract
Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world's 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 million hours of speech from existing publicly accessible corpora with a newly created corpus of 7400+ hours from 4057 languages, which will be publicly released. To handle the diverse conditions of multilingual speech data, we augment the typical SSL masked prediction approach with a novel dereverberation objective, increasing robustness. We evaluate XEUS on several benchmarks, and show that it consistently outperforms or achieves comparable results to state-of-the-art (SOTA) SSL models across a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
