A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments
P. Godard, G. Adda, M. Adda-Decker, J. Benjumea, L. Besacier, J., Cooper-Leavitt, G-N. Kouarata, L. Lamel, H. Maynard, M. Mueller, A. Rialland,, S. Stueker, F. Yvon, M. Zanon-Boito

TL;DR
This paper introduces a speech corpus for Mboshi, an endangered language, designed to facilitate computational language documentation and zero-resource speech tasks, aiding linguists in analyzing unwritten languages.
Contribution
It provides a real-world, annotated speech dataset for Mboshi, enabling research in zero-resource speech processing and language documentation.
Findings
Dataset supports automatic phoneme and lexicon discovery
Effective for spoken term discovery experiments
Facilitates reproducible research in low-resource language processing
Abstract
Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources or stable orthography. Systems constructed under these almost zero resource conditions are not only promising for speech technology but also for computational language documentation. The goal of computational language documentation is to help field linguists to (semi-)automatically analyze and annotate audio recordings of endangered and unwritten languages. Example tasks are automatic phoneme discovery or lexicon discovery from the speech signal. This paper presents a speech corpus collected during a realistic language documentation process. It is made up of 5k speech utterances in Mboshi (Bantu C25) aligned to French text translations. Speech transcriptions are also made available: they correspond to a non-standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
