WorldSpeech: A Multilingual Speech Corpus from Around the World
Antonis Asonitis, Luca A. Lanzend\"orfer, Fr\'ed\'eric Berdoz, Roger Wattenhofer

TL;DR
WorldSpeech is a large, multilingual speech corpus with 65,000 hours of data across 76 languages, designed to improve ASR performance for low-resource languages.
Contribution
The paper introduces WorldSpeech, a comprehensive multilingual speech dataset collected from diverse sources, enabling significant improvements in ASR accuracy for underrepresented languages.
Findings
Fine-tuning ASR models on WorldSpeech reduces Word-Error-Rate by 63.5% on average.
WorldSpeech covers 76 languages with varying amounts of data, including over 200 hours for 37 languages.
The dataset is collected from parliamentary proceedings, broadcasts, and audiobooks, ensuring diversity.
Abstract
Automatic speech recognition (ASR) performs well for high-resource languages with abundant paired audio-transcript data, but its accuracy degrades sharply for most languages due to limited publicly available aligned data. To this end, we introduce WorldSpeech, a 24 kHz multilingual speech corpus comprising 65k hours of aligned audio-transcript data across 76 languages, collected from diverse public sources including parliamentary proceedings, international broadcasts, and public-domain audiobooks. For 37 languages, WorldSpeech provides more than 200 hours of aligned speech, with 28 exceeding 500 hours and 24 surpassing 1k hours. Fine-tuning existing ASR models on WorldSpeech results in an average relative Word-Error-Rate reduction of 63.5% across 11 typologically diverse languages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
