Are LLMs Good Text Diacritizers? An Arabic and Yoruba Case Study

Hawau Olamide Toyin; Samar Mohamed Magdy; Hanan Aldarmaki

arXiv:2506.11602·cs.CL·March 18, 2026

Are LLMs Good Text Diacritizers? An Arabic and Yoruba Case Study

Hawau Olamide Toyin, Samar Mohamed Magdy, Hanan Aldarmaki

PDF

Open Access 1 Datasets

TL;DR

This study evaluates large language models for text diacritization in Arabic and Yoruba, introducing a new dataset and demonstrating that many LLMs outperform specialized models, with fine-tuning further improving results for Yoruba.

Contribution

The paper introduces MultiDiac, a novel multilingual dataset for diacritization, and benchmarks 12 LLMs against specialized models, highlighting the potential of LLMs and benefits of fine-tuning.

Findings

01

Many LLMs outperform specialized diacritization models.

02

Smaller models tend to hallucinate more.

03

Fine-tuning improves Yoruba diacritization performance.

Abstract

We investigate the effectiveness of large language models (LLMs) for text diacritization in two typologically distinct languages: Arabic and Yoruba. To enable a rigorous evaluation, we introduce a novel multilingual dataset MultiDiac, with diverse samples that capture a range of diacritic ambiguities. We evaluate 12 LLMs varying in size, accessibility, and language coverage, and benchmark them against $4$ specialized diacritization models. Additionally, we fine-tune four small open-source models using LoRA for Yoruba. Our results show that many off-the-shelf LLMs outperform specialized diacritization models, but smaller models suffer from hallucinations. We find that fine-tuning on a small dataset can help improve diacritization performance and reduce hallucinations for Yoruba.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

herwoww/MultiDiac
dataset· 10 dl
10 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing