Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models
Ali Mekky, Mohamed El Zeftawy, Lara Hassan, Amr Keleg, Preslav Nakov

TL;DR
This paper introduces a multi-label Arabic Dialect Identification approach using curriculum learning and GPT-4 generated annotations, significantly improving performance over previous single-label models.
Contribution
It constructs a large-scale multi-label dataset with GPT-4 and classifiers, and applies curriculum learning to enhance dialect identification accuracy.
Findings
Achieved macro F1 of 0.69 on MLADI leaderboard
Outperformed previous systems with a macro F1 of 0.55
Demonstrated effectiveness of curriculum learning in dialect classification
Abstract
Being modeled as a single-label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi-label classification task. However, ADI remains constrained by the availability of single-label datasets, with no large-scale multi-label resources available for training. By analyzing models trained on single-label ADI data, we show that the main difficulty in repurposing such datasets for Multi-Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples, as many sentences treated as negative could be acceptable in multiple dialects. To address these issues, we construct a multi-label dataset by generating automatic multi-label annotations using GPT-4o and binary dialect acceptability classifiers, with aggregation guided by the Arabic Level of Dialectness (ALDi). Afterward, we train a BERT-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAuthorship Attribution and Profiling · Text and Document Classification Technologies · Natural Language Processing Techniques
