The Development of a Labelled te reo M\=aori-English Bilingual Database for Language Technology
Jesin James, Isabella Shields, Vithya Yogarajan, Peter J. Keegan,, Catherine Watson, Peter-Lucas Jones, and Keoni Mahelona

TL;DR
This paper presents a large, annotated bilingual Me2ori-English database built from parliamentary debates, facilitating future language technology development for the under-resourced Me2ori language and demonstrating a methodology applicable to other low-resource languages.
Contribution
The creation of a comprehensive, word-level annotated Me2ori-English database using automated rules and manual annotation, along with analysis of its linguistic features.
Findings
Database contains over 66 million words with annotations.
Analysis includes metadata, word frequency, sentence length, and N-grams.
Methodology can be applied to other low-resource language pairs.
Abstract
Te reo M\=aori (referred to as M\=aori), New Zealand's indigenous language, is under-resourced in language technology. M\=aori speakers are bilingual, where M\=aori is code-switched with English. Unfortunately, there are minimal resources available for M\=aori language technology, language detection and code-switch detection between M\=aori-English pair. Both English and M\=aori use Roman-derived orthography making rule-based systems for detecting language and code-switching restrictive. Most M\=aori language detection is done manually by language experts. This research builds a M\=aori-English bilingual database of 66,016,807 words with word-level language annotation. The New Zealand Parliament Hansard debates reports were used to build the database. The language labels are assigned using language-specific rules and expert manual annotations. Words with the same spelling, but different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLinguistic Variation and Morphology · Multilingual Education and Policy
