Diacritic Restoration for Low-Resource Indigenous Languages: Case Study with Bribri and Cook Islands M\=aori
Rolando Coto-Solano, Daisy Li, Manoela Teleginski Ferraz, Olivia Sasse, Cha Krupka, Sharid Lo\'aiciga, Sally Akevai Tenamu Nicholas

TL;DR
This study evaluates diacritic restoration methods for under-resourced languages, showing fine-tuned character-level LLMs outperform multilingual models, with effective performance starting at around 10,000 words of data.
Contribution
It compares algorithms for diacritic restoration in low-resource languages, analyzing data requirements and model effectiveness, highlighting the superiority of fine-tuned character-level LLMs.
Findings
Fine-tuned character-level LLMs perform best.
Reliable diacritic restoration begins at ~10,000 words of data.
Zero-shot approaches perform poorly.
Abstract
We present experiments on diacritic restoration, a form of text normalization essential for natural language processing (NLP) tasks. Our study focuses on two extremely under-resourced languages: Bribri, a Chibchan language spoken in Costa Rica, and Cook Islands M\=aori, a Polynesian language spoken in the Cook Islands. Specifically, this paper: (i) compares algorithms for diacritics restoration in under-resourced languages, including tonal diacritics, (ii) examines the amount of data required to achieve target performance levels, (iii) contrasts results across varying resource conditions, and (iv) explores the related task of diacritic correction. We find that fine-tuned, character-level LLMs perform best, likely due to their ability to decompose complex characters into their UTF-8 byte representations. In contrast, massively multilingual models perform less effectively given our data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Computational and Text Analysis Methods · Text Readability and Simplification
