Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for   Speech Recognition

Rodolfo Zevallos; Luis Camacho; Nelsi Melgarejo

arXiv:2207.05498·cs.CL·July 13, 2022

Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition

Rodolfo Zevallos, Luis Camacho, Nelsi Melgarejo

PDF

Open Access

TL;DR

Huqariq is a large, multilingual speech corpus of Peruvian native languages aimed at advancing speech technology for language preservation, employing crowdsourcing to collect and transcribe over 220 hours of speech from native speakers.

Contribution

This work introduces the Huqariq corpus, the largest collection of native Peruvian language speech data, and demonstrates its utility through speech recognition experiments.

Findings

01

220 hours of transcribed speech data collected

02

Corpus includes four native languages, aiming for 20 by 2022

03

Speech recognition experiments validate corpus quality

Abstract

The Huqariq corpus is a multilingual collection of speech from native Peruvian languages. The transcribed corpus is intended for the research and development of speech technologies to preserve endangered languages in Peru. Huqariq is primarily designed for the development of automatic speech recognition, language identification and text-to-speech tools. In order to achieve corpus collection sustainably, we employ the crowdsourcing methodology. Huqariq includes four native languages of Peru, and it is expected that by the end of the year 2022, it can reach up to 20 native languages out of the 48 native languages in Peru. The corpus has 220 hours of transcribed audio recorded by more than 500 volunteers, making it the largest speech corpus for native languages in Peru. In order to verify the quality of the corpus, we present speech recognition experiments using 220 hours of fully…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems