Speech Wikimedia: A 77 Language Multilingual Speech Dataset
Rafael Mosquera G\'omez, Juli\'an Eusse, Juan Ciro, Daniel Galvez,, Ryan Hileman, Kurt Bollacker, David Kanter

TL;DR
Speech Wikimedia provides a large, multilingual speech dataset with transcribed audio in 77 languages, supporting advancements in speech recognition, translation, and multilingual AI research.
Contribution
It introduces a comprehensive, publicly available multilingual speech dataset with diverse scenarios and speakers, enabling improved training for speech and translation models.
Findings
Supports training of multilingual speech recognition models
Facilitates research in speech translation across 77 languages
Enhances dataset diversity with various scenarios and speakers
Abstract
The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA licensed transcribed speech from a diverse set of scenarios and speakers, in 77 different languages. Each audio file has one or more transcriptions in different languages, making this dataset suitable for training speech recognition, speech translation, and machine translation models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
