MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages
Marco Gaido, Sara Papi, Luisa Bentivogli, Alessio Brutti, Mauro, Cettolo, Roberto Gretter, Marco Matassoni, Mohamed Nabih, Matteo Negri

TL;DR
This paper introduces a large open-source speech dataset of 950,000 hours for EU languages, enabling the development of fully open-source speech foundation models with publicly available data, code, and model weights.
Contribution
It provides the first comprehensive open-source speech dataset for EU languages, including transcripts for unlabeled data, promoting transparent and accessible speech foundation model development.
Findings
Collected 950k hours of open-source speech data for EU languages
Released 441k hours of automatic transcripts under permissive license
Facilitated creation of fully open-source speech foundation models
Abstract
The rise of foundation models (FMs), coupled with regulatory efforts addressing their risks and impacts, has sparked significant interest in open-source models. However, existing speech FMs (SFMs) fall short of full compliance with the open-source principles, even if claimed otherwise, as no existing SFM has model weights, code, and training data publicly available under open-source terms. In this work, we take the first step toward filling this gap by focusing on the 24 official languages of the European Union (EU). We collect suitable training data by surveying automatic speech recognition datasets and unlabeled speech corpora under open-source compliant licenses, for a total of 950k hours. Additionally, we release automatic transcripts for 441k hours of unlabeled data under the permissive CC-BY license, thereby facilitating the creation of open-source SFMs for the EU languages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/parakeet-tdt-0.6b-v3model· 254k dl· ♡ 747254k dl♡ 747
- 🤗nvidia/parakeet-tdt-0.6b-v2model· 164k dl· ♡ 1444164k dl♡ 1444
- 🤗nvidia/canary-1b-v2model· 123k dl· ♡ 371123k dl♡ 371
- 🤗SoSolaris/parakeet-tdt-0.6b-v3model· 7 dl7 dl
- 🤗ManuelZnnmc/parakeet-tdt-0.6b-v3model· 1 dl1 dl
- 🤗MadnessOverflow/parakeet-tdt-0.6b-v3-bpe-vocabmodel
- 🤗Endy2001/parakeet-tdt-0.6b-v3model· 3 dl3 dl
- 🤗everyscribe/parakeet-tdt-0.6b-v3model· 9 dl9 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis
