QuranMorph: Morphologically Annotated Quranic Corpus

Diyam Akra; Tymaa Hammouda; Mustafa Jarrar

arXiv:2506.18148·cs.CL·June 24, 2025

QuranMorph: Morphologically Annotated Quranic Corpus

Diyam Akra, Tymaa Hammouda, Mustafa Jarrar

PDF

TL;DR

QuranMorph is a manually annotated, morphologically rich corpus of the Quran with detailed lemmatization and POS tagging, facilitating advanced linguistic analysis and resource interlinking.

Contribution

It introduces a comprehensive, manually annotated Quranic corpus with detailed morphological and POS tags, linked to multiple linguistic resources, and publicly available.

Findings

01

High-quality manual annotations for 77,429 tokens

02

Rich morphological and POS tagging with 40 tags

03

Inter-linking with various linguistic resources

Abstract

We present the QuranMorph corpus, a morphologically annotated corpus for the Quran (77,429 tokens). Each token in the QuranMorph was manually lemmatized and tagged with its part-of-speech by three expert linguists. The lemmatization process utilized lemmas from Qabas, an Arabic lexicographic database linked with 110 lexicons and corpora of 2 million tokens. The part-of-speech tagging was performed using the fine-grained SAMA/Qabas tagset, which encompasses 40 tags. As shown in this paper, this rich lemmatization and POS tagset enabled the QuranMorph corpus to be inter-linked with many linguistic resources. The corpus is open-source and publicly available as part of the SinaLab resources at (https://sina.birzeit.edu/quran)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.