Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models

Ali Mekky; Mohamed El Zeftawy; Lara Hassan; Amr Keleg; Preslav Nakov

arXiv:2602.12937·cs.CL·February 18, 2026

Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models

Ali Mekky, Mohamed El Zeftawy, Lara Hassan, Amr Keleg, Preslav Nakov

PDF

Open Access 1 Video

TL;DR

This paper introduces a multi-label Arabic Dialect Identification approach using curriculum learning and GPT-4 generated annotations, significantly improving performance over previous single-label models.

Contribution

It constructs a large-scale multi-label dataset with GPT-4 and classifiers, and applies curriculum learning to enhance dialect identification accuracy.

Findings

01

Achieved macro F1 of 0.69 on MLADI leaderboard

02

Outperformed previous systems with a macro F1 of 0.55

03

Demonstrated effectiveness of curriculum learning in dialect classification

Abstract

Being modeled as a single-label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi-label classification task. However, ADI remains constrained by the availability of single-label datasets, with no large-scale multi-label resources available for training. By analyzing models trained on single-label ADI data, we show that the main difficulty in repurposing such datasets for Multi-Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples, as many sentences treated as negative could be acceptable in multiple dialects. To address these issues, we construct a multi-label dataset by generating automatic multi-label annotations using GPT-4o and binary dialect acceptability classifiers, with aggregation guided by the Arabic Level of Dialectness (ALDi). Afterward, we train a BERT-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models· underline

Taxonomy

TopicsAuthorship Attribution and Profiling · Text and Document Classification Technologies · Natural Language Processing Techniques