Konooz: Multi-domain Multi-dialect Corpus for Named Entity Recognition

Nagham Hamad; Mohammed Khalilia; Mustafa Jarrar

arXiv:2506.12615·cs.CL·June 17, 2025

Konooz: Multi-domain Multi-dialect Corpus for Named Entity Recognition

Nagham Hamad, Mohammed Khalilia, Mustafa Jarrar

PDF

Open Access

TL;DR

Konooz is a comprehensive multi-dialect, multi-domain Arabic corpus with extensive annotations, used to benchmark NER models and analyze cross-domain and cross-dialect performance issues in Arabic NLP.

Contribution

This paper introduces Konooz, a large, annotated multi-dialect Arabic corpus, and provides the first benchmarking of Arabic NER models across diverse domains and dialects.

Findings

01

NER model performance drops up to 38% across domains and dialects

02

Significant divergence observed between domains and dialects using MMD metric

03

Certain models perform better on specific dialects and domains

Abstract

We introduce Konooz, a novel multi-dimensional corpus covering 16 Arabic dialects across 10 domains, resulting in 160 distinct corpora. The corpus comprises about 777k tokens, carefully collected and manually annotated with 21 entity types using both nested and flat annotation schemes - using the Wojood guidelines. While Konooz is useful for various NLP tasks like domain adaptation and transfer learning, this paper primarily focuses on benchmarking existing Arabic Named Entity Recognition (NER) models, especially cross-domain and cross-dialect model performance. Our benchmarking of four Arabic NER models using Konooz reveals a significant drop in performance of up to 38% when compared to the in-distribution data. Furthermore, we present an in-depth analysis of domain and dialect divergence and the impact of resource scarcity. We also measured the overlap between domains and dialects…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems