Towards Robust Speech Representation Learning for Thousands of Languages

William Chen; Wangyou Zhang; Yifan Peng; Xinjian Li; Jinchuan Tian,; Jiatong Shi; Xuankai Chang; Soumi Maiti; Karen Livescu; Shinji Watanabe

arXiv:2407.00837·cs.CL·July 3, 2024

Towards Robust Speech Representation Learning for Thousands of Languages

William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian,, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, Shinji Watanabe

PDF

Open Access 1 Models 5 Datasets 1 Video

TL;DR

XEUS is a large-scale, multilingual speech representation model trained on over 1 million hours of data from 4057 languages, improving robustness and setting new state-of-the-art results across various speech benchmarks.

Contribution

The paper introduces XEUS, a novel cross-lingual SSL model trained on unprecedented multilingual data, with a new dereverberation objective to enhance robustness, significantly expanding language coverage and performance.

Findings

01

XEUS outperforms existing SSL models on multiple benchmarks.

02

It achieves new state-of-the-art results on ML-SUPERB.

03

The model demonstrates robustness across diverse multilingual speech conditions.

Abstract

Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world's 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 million hours of speech from existing publicly accessible corpora with a newly created corpus of 7400+ hours from 4057 languages, which will be publicly released. To handle the diverse conditions of multilingual speech data, we augment the typical SSL masked prediction approach with a novel dereverberation objective, increasing robustness. We evaluate XEUS on several benchmarks, and show that it consistently outperforms or achieves comparable results to state-of-the-art (SOTA) SSL models across a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
espnet/xeus
model· 33 dl· ♡ 146
33 dl♡ 146

Datasets

Videos

Towards Robust Speech Representation Learning for Thousands of Languages· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems