Speech Wikimedia: A 77 Language Multilingual Speech Dataset

Rafael Mosquera G\'omez; Juli\'an Eusse; Juan Ciro; Daniel Galvez,; Ryan Hileman; Kurt Bollacker; David Kanter

arXiv:2308.15710·cs.AI·August 31, 2023

Speech Wikimedia: A 77 Language Multilingual Speech Dataset

Rafael Mosquera G\'omez, Juli\'an Eusse, Juan Ciro, Daniel Galvez,, Ryan Hileman, Kurt Bollacker, David Kanter

PDF

Open Access 1 Repo

TL;DR

Speech Wikimedia provides a large, multilingual speech dataset with transcribed audio in 77 languages, supporting advancements in speech recognition, translation, and multilingual AI research.

Contribution

It introduces a comprehensive, publicly available multilingual speech dataset with diverse scenarios and speakers, enabling improved training for speech and translation models.

Findings

01

Supports training of multilingual speech recognition models

02

Facilitates research in speech translation across 77 languages

03

Enhances dataset diversity with various scenarios and speakers

Abstract

The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA licensed transcribed speech from a diverse set of scenarios and speakers, in 77 different languages. Each audio file has one or more transcriptions in different languages, making this dataset suitable for training speech recognition, speech translation, and machine translation models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/datasets/MLCommons/speech-wikimedia
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing