ML-SUPERB: Multilingual Speech Universal PERformance Benchmark
Jiatong Shi, Dan Berrebbi, William Chen, Ho-Lam Chung, En-Pei Hu, Wei, Ping Huang, Xuankai Chang, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee,, Shinji Watanabe

TL;DR
ML-SUPERB extends the SUPERB benchmark to 143 languages, evaluating multilingual speech processing models and revealing insights into their performance and limitations across diverse languages.
Contribution
This paper introduces ML-SUPERB, a multilingual benchmark for speech SSL models covering 143 languages, enabling comprehensive evaluation beyond English.
Findings
Speech SSL models outperform FBANK features in multilingual tasks.
Multilingual models do not always outperform monolingual models.
ML-SUPERB will be released with datasets and scripts for future research.
Abstract
Speech processing Universal PERformance Benchmark (SUPERB) is a leaderboard to benchmark the performance of Self-Supervised Learning (SSL) models on various speech processing tasks. However, SUPERB largely considers English speech in its evaluation. This paper presents multilingual SUPERB (ML-SUPERB), covering 143 languages (ranging from high-resource to endangered), and considering both automatic speech recognition and language identification. Following the concept of SUPERB, ML-SUPERB utilizes frozen SSL features and employs a simple framework for multilingual tasks by learning a shallow downstream model. Similar to the SUPERB benchmark, we find speech SSL models can significantly improve performance compared to FBANK features. Furthermore, we find that multilingual models do not always perform better than their monolingual counterparts. We will release ML-SUPERB as a challenge with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
