Are LLMs Good Text Diacritizers? An Arabic and Yoruba Case Study
Hawau Olamide Toyin, Samar Mohamed Magdy, Hanan Aldarmaki

TL;DR
This study evaluates large language models for text diacritization in Arabic and Yoruba, introducing a new dataset and demonstrating that many LLMs outperform specialized models, with fine-tuning further improving results for Yoruba.
Contribution
The paper introduces MultiDiac, a novel multilingual dataset for diacritization, and benchmarks 12 LLMs against specialized models, highlighting the potential of LLMs and benefits of fine-tuning.
Findings
Many LLMs outperform specialized diacritization models.
Smaller models tend to hallucinate more.
Fine-tuning improves Yoruba diacritization performance.
Abstract
We investigate the effectiveness of large language models (LLMs) for text diacritization in two typologically distinct languages: Arabic and Yoruba. To enable a rigorous evaluation, we introduce a novel multilingual dataset MultiDiac, with diverse samples that capture a range of diacritic ambiguities. We evaluate 12 LLMs varying in size, accessibility, and language coverage, and benchmark them against specialized diacritization models. Additionally, we fine-tune four small open-source models using LoRA for Yoruba. Our results show that many off-the-shelf LLMs outperform specialized diacritization models, but smaller models suffer from hallucinations. We find that fine-tuning on a small dataset can help improve diacritization performance and reduce hallucinations for Yoruba.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing
