Zero-Resource Multi-Dialectal Arabic Natural Language Understanding

Muhammad Khalifa; Hesham Hassan; Aly Fahmy

arXiv:2104.06591·cs.CL·May 27, 2022

Zero-Resource Multi-Dialectal Arabic Natural Language Understanding

Muhammad Khalifa, Hesham Hassan, Aly Fahmy

PDF

Open Access

TL;DR

This paper explores zero-resource dialectal Arabic understanding by using self-training on unlabeled dialect data to improve performance on NER, POS tagging, and sarcasm detection, achieving state-of-the-art results.

Contribution

It introduces a self-training approach with unlabeled dialectal Arabic data to enhance zero-shot transfer from standard Arabic, addressing resource scarcity.

Findings

01

Self-training improves NER F1 by ~10%

02

Enhances POS tagging accuracy by 2%

03

Boosts sarcasm detection F1 by 4.5%

Abstract

A reasonable amount of annotated data is required for fine-tuning pre-trained language models (PLM) on downstream tasks. However, obtaining labeled examples for different language varieties can be costly. In this paper, we investigate the zero-shot performance on Dialectal Arabic (DA) when fine-tuning a PLM on modern standard Arabic (MSA) data only -- identifying a significant performance drop when evaluating such models on DA. To remedy such performance drop, we propose self-training with unlabeled DA data and apply it in the context of named entity recognition (NER), part-of-speech (POS) tagging, and sarcasm detection (SRD) on several DA varieties. Our results demonstrate the effectiveness of self-training with unlabeled DA data: improving zero-shot MSA-to-DA transfer by as large as $\sim$ 10\% F $_{1}$ (NER), 2\% accuracy (POS tagging), and 4.5\% F $_{1}$ (SRD). We conduct an ablation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification