Zero-Resource Multi-Dialectal Arabic Natural Language Understanding
Muhammad Khalifa, Hesham Hassan, Aly Fahmy

TL;DR
This paper explores zero-resource dialectal Arabic understanding by using self-training on unlabeled dialect data to improve performance on NER, POS tagging, and sarcasm detection, achieving state-of-the-art results.
Contribution
It introduces a self-training approach with unlabeled dialectal Arabic data to enhance zero-shot transfer from standard Arabic, addressing resource scarcity.
Findings
Self-training improves NER F1 by ~10%
Enhances POS tagging accuracy by 2%
Boosts sarcasm detection F1 by 4.5%
Abstract
A reasonable amount of annotated data is required for fine-tuning pre-trained language models (PLM) on downstream tasks. However, obtaining labeled examples for different language varieties can be costly. In this paper, we investigate the zero-shot performance on Dialectal Arabic (DA) when fine-tuning a PLM on modern standard Arabic (MSA) data only -- identifying a significant performance drop when evaluating such models on DA. To remedy such performance drop, we propose self-training with unlabeled DA data and apply it in the context of named entity recognition (NER), part-of-speech (POS) tagging, and sarcasm detection (SRD) on several DA varieties. Our results demonstrate the effectiveness of self-training with unlabeled DA data: improving zero-shot MSA-to-DA transfer by as large as 10\% F (NER), 2\% accuracy (POS tagging), and 4.5\% F (SRD). We conduct an ablation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
