TL;DR
This paper introduces the Arabic Generality Score (AGS), a new measure to quantify how widely words are used across Arabic dialects, complementing existing dialectness modeling approaches.
Contribution
The paper proposes AGS as a scalable, linguistically grounded measure of lexical generality and develops a pipeline to annotate corpora and predict AGS in context.
Findings
AGS outperforms state-of-the-art dialect identification systems
The pipeline effectively annotates large parallel corpora with AGS
AGS enriches representations of Arabic dialectness
Abstract
Arabic dialects form a diverse continuum, yet NLP models often treat them as discrete categories. Recent work addresses this issue by modeling dialectness as a continuous variable, notably through the Arabic Level of Dialectness (ALDi). However, ALDi reduces complex variation to a single dimension. We propose a complementary measure: the Arabic Generality Score (AGS), which quantifies how widely a word is used across dialects. We introduce a pipeline that combines word alignment, etymology-aware edit distance, and smoothing to annotate a parallel corpus with word-level AGS. A regression model is then trained to predict AGS in context. Our approach outperforms strong baselines, including state-of-the-art dialect ID systems, on a multi-dialect benchmark. AGS offers a scalable, linguistically grounded way to model lexical generality, enriching representations of Arabic dialectness.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
