YODAS: Youtube-Oriented Dataset for Audio and Speech
Xinjian Li, Shinnosuke Takamichi, Takaaki Saeki, William Chen, Sayaka, Shiota, Shinji Watanabe

TL;DR
YODAS is a large-scale, multilingual YouTube speech dataset with over 500,000 hours of data across 100+ languages, supporting supervised and self-supervised speech recognition research.
Contribution
This paper introduces YODAS, the first publicly available large-scale multilingual YouTube speech dataset, including collection methodology and baseline speech recognition results.
Findings
YODAS contains over 500,000 hours of speech data in 100+ languages.
The dataset supports both supervised and self-supervised learning approaches.
Baseline speech recognition models achieve promising results on the top-15 languages.
Abstract
In this study, we introduce YODAS (YouTube-Oriented Dataset for Audio and Speech), a large-scale, multilingual dataset comprising currently over 500k hours of speech data in more than 100 languages, sourced from both labeled and unlabeled YouTube speech datasets. The labeled subsets, including manual or automatic subtitles, facilitate supervised model training. Conversely, the unlabeled subsets are apt for self-supervised learning applications. YODAS is distinctive as the first publicly available dataset of its scale, and it is distributed under a Creative Commons license. We introduce the collection methodology utilized for YODAS, which contributes to the large-scale speech dataset construction. Subsequently, we provide a comprehensive analysis of speech, text contained within the dataset. Finally, we describe the speech recognition baselines over the top-15 languages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/parakeet-tdt-0.6b-v3model· 254k dl· ♡ 747254k dl♡ 747
- 🤗nvidia/parakeet-tdt-0.6b-v2model· 164k dl· ♡ 1444164k dl♡ 1444
- 🤗nvidia/canary-1b-v2model· 123k dl· ♡ 371123k dl♡ 371
- 🤗SoSolaris/parakeet-tdt-0.6b-v3model· 7 dl7 dl
- 🤗ManuelZnnmc/parakeet-tdt-0.6b-v3model· 1 dl1 dl
- 🤗MadnessOverflow/parakeet-tdt-0.6b-v3-bpe-vocabmodel
- 🤗Endy2001/parakeet-tdt-0.6b-v3model· 3 dl3 dl
- 🤗everyscribe/parakeet-tdt-0.6b-v3model· 9 dl9 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization
