Noor-Ghateh: A Benchmark Dataset for Evaluating Arabic Word Segmenters in Hadith Domain
Huda AlShuhayeb, Behrouz Minaei-Bidgoli, Mohammad E. Shenassa,, Sayyed-Ali Hossayni

TL;DR
This paper introduces Noor-Ghateh, a comprehensive Arabic word segmentation dataset from Hadith texts, enabling better evaluation of segmentation tools in religious and literary Arabic contexts.
Contribution
It provides a large, expert-labeled dataset for Arabic word segmentation in the Hadith domain, surpassing existing datasets in volume and variety.
Findings
The dataset contains approximately 223,690 words.
Benchmarking with tools like Farasa, Camel, and ALP shows high annotation quality.
The dataset improves evaluation of morphological segmentation tools.
Abstract
There are numerous complex and rich morphological features in the Arabic language, which are highly useful when analyzing traditional Arabic textbooks, especially in the literary and religious contexts, and help in understanding the meaning of the textbooks. Vocabulary separation means separating the word into different components, such as the root and affixes. In the morphological datasets, the variety of markers and the number of data samples help to evaluate the morphological techniques. In this paper, we present a standard dataset for analyzing the Arabic segmentation tools, which includes approximately 223,690 words from the "Shariat al-Islam" book, labeled by human experts. In terms of volume and word variety, this dataset is superior to the other Hadith Arabic datasets, to the best of our knowledge. To estimate the dataset, we applied different methods, including Farasa, Camel,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text and Document Classification Technologies
