Proper Noun Diacritization for Arabic Wikipedia: A Benchmark Dataset

Rawan Bondok; Mayar Nassar; Salam Khalifa; Kurt Micallef; Nizar Habash

arXiv:2505.02656·cs.CL·June 24, 2025

Proper Noun Diacritization for Arabic Wikipedia: A Benchmark Dataset

Rawan Bondok, Mayar Nassar, Salam Khalifa, Kurt Micallef, Nizar Habash

PDF

TL;DR

This paper introduces a new dataset of diacritized Arabic proper nouns from Wikipedia, benchmarks GPT-4o on diacritization accuracy, and highlights the challenges and need for better models in this area.

Contribution

It provides the first manually diacritized dataset of Arabic proper nouns with English glosses and benchmarks GPT-4o's performance on this task.

Findings

01

GPT-4o achieves 73% accuracy in diacritization

02

The task remains challenging, indicating room for model improvement

03

The dataset facilitates future research in Arabic proper noun diacritization

Abstract

Proper nouns in Arabic Wikipedia are frequently undiacritized, creating ambiguity in pronunciation and interpretation, especially for transliterated named entities of foreign origin. While transliteration and diacritization have been well-studied separately in Arabic NLP, their intersection remains underexplored. In this paper, we introduce a new manually diacritized dataset of Arabic proper nouns of various origins with their English Wikipedia equivalent glosses, and present the challenges and guidelines we followed to create it. We benchmark GPT-4o on the task of recovering full diacritization given the undiacritized Arabic and English forms, and analyze its performance. Achieving 73% accuracy, our results underscore both the difficulty of the task and the need for improved models and resources. We release our dataset to facilitate further research on Arabic Wikipedia proper noun…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.