Diacritic Restoration for Low-Resource Indigenous Languages: Case Study with Bribri and Cook Islands M\=aori

Rolando Coto-Solano; Daisy Li; Manoela Teleginski Ferraz; Olivia Sasse; Cha Krupka; Sharid Lo\'aiciga; Sally Akevai Tenamu Nicholas

arXiv:2512.19630·cs.CL·December 23, 2025

Diacritic Restoration for Low-Resource Indigenous Languages: Case Study with Bribri and Cook Islands M\=aori

Rolando Coto-Solano, Daisy Li, Manoela Teleginski Ferraz, Olivia Sasse, Cha Krupka, Sharid Lo\'aiciga, Sally Akevai Tenamu Nicholas

PDF

Open Access

TL;DR

This study evaluates diacritic restoration methods for under-resourced languages, showing fine-tuned character-level LLMs outperform multilingual models, with effective performance starting at around 10,000 words of data.

Contribution

It compares algorithms for diacritic restoration in low-resource languages, analyzing data requirements and model effectiveness, highlighting the superiority of fine-tuned character-level LLMs.

Findings

01

Fine-tuned character-level LLMs perform best.

02

Reliable diacritic restoration begins at ~10,000 words of data.

03

Zero-shot approaches perform poorly.

Abstract

We present experiments on diacritic restoration, a form of text normalization essential for natural language processing (NLP) tasks. Our study focuses on two extremely under-resourced languages: Bribri, a Chibchan language spoken in Costa Rica, and Cook Islands M\=aori, a Polynesian language spoken in the Cook Islands. Specifically, this paper: (i) compares algorithms for diacritics restoration in under-resourced languages, including tonal diacritics, (ii) examines the amount of data required to achieve target performance levels, (iii) contrasts results across varying resource conditions, and (iv) explores the related task of diacritic correction. We find that fine-tuned, character-level LLMs perform best, likely due to their ability to decompose complex characters into their UTF-8 byte representations. In contrast, massively multilingual models perform less effectively given our data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Computational and Text Analysis Methods · Text Readability and Simplification