Estimating the Level of Dialectness Predicts Interannotator Agreement in   Multi-dialect Arabic Datasets

Amr Keleg; Walid Magdy; Sharon Goldwater

arXiv:2405.11282·cs.CL·June 10, 2024

Estimating the Level of Dialectness Predicts Interannotator Agreement in Multi-dialect Arabic Datasets

Amr Keleg, Walid Magdy, Sharon Goldwater

PDF

Open Access 1 Repo

TL;DR

This paper investigates how the Arabic Level of Dialectness (ALDi) score can predict annotator agreement in multi-dialect Arabic datasets, suggesting that high ALDi samples should be routed to native speakers for better annotation quality.

Contribution

It introduces the use of ALDi scores to predict annotation difficulty and demonstrates that routing high ALDi samples to native speakers improves dataset quality.

Findings

01

High ALDi scores correlate with lower annotator agreement.

02

Routing high ALDi samples to native speakers enhances annotation accuracy.

03

Strong evidence of ALDi's predictive power in 11 out of 15 datasets.

Abstract

On annotating multi-dialect Arabic datasets, it is common to randomly assign the samples across a pool of native Arabic speakers. Recent analyses recommended routing dialectal samples to native speakers of their respective dialects to build higher-quality datasets. However, automatically identifying the dialect of samples is hard. Moreover, the pool of annotators who are native speakers of specific Arabic dialects might be scarce. Arabic Level of Dialectness (ALDi) was recently introduced as a quantitative variable that measures how sentences diverge from Standard Arabic. On randomly assigning samples to annotators, we hypothesize that samples of higher ALDi scores are harder to label especially if they are written in dialects that the annotators do not speak. We test this by analyzing the relation between ALDi scores and the annotators' agreement, on 15 public datasets having raw…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amr-keleg/aldi-and-iaa
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling