Estimating the Level of Dialectness Predicts Interannotator Agreement in Multi-dialect Arabic Datasets
Amr Keleg, Walid Magdy, Sharon Goldwater

TL;DR
This paper investigates how the Arabic Level of Dialectness (ALDi) score can predict annotator agreement in multi-dialect Arabic datasets, suggesting that high ALDi samples should be routed to native speakers for better annotation quality.
Contribution
It introduces the use of ALDi scores to predict annotation difficulty and demonstrates that routing high ALDi samples to native speakers improves dataset quality.
Findings
High ALDi scores correlate with lower annotator agreement.
Routing high ALDi samples to native speakers enhances annotation accuracy.
Strong evidence of ALDi's predictive power in 11 out of 15 datasets.
Abstract
On annotating multi-dialect Arabic datasets, it is common to randomly assign the samples across a pool of native Arabic speakers. Recent analyses recommended routing dialectal samples to native speakers of their respective dialects to build higher-quality datasets. However, automatically identifying the dialect of samples is hard. Moreover, the pool of annotators who are native speakers of specific Arabic dialects might be scarce. Arabic Level of Dialectness (ALDi) was recently introduced as a quantitative variable that measures how sentences diverge from Standard Arabic. On randomly assigning samples to annotators, we hypothesize that samples of higher ALDi scores are harder to label especially if they are written in dialects that the annotators do not speak. We test this by analyzing the relation between ALDi scores and the annotators' agreement, on 15 public datasets having raw…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling
