The Interplay of Variant, Size, and Task Type in Arabic Pre-trained   Language Models

Go Inoue; Bashar Alhafni; Nurpeiis Baimukan; Houda Bouamor; Nizar; Habash

arXiv:2103.06678·cs.CL·September 7, 2021·140 cites

The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models

Go Inoue, Bashar Alhafni, Nurpeiis Baimukan, Houda Bouamor, Nizar, Habash

PDF

Open Access 1 Repo 10 Models

TL;DR

This study investigates how language variant, data size, and task type influence Arabic pre-trained language models, revealing that variant proximity to fine-tuning data outweighs data size in importance.

Contribution

It introduces multiple Arabic language models across variants and sizes, and demonstrates the significance of variant proximity over data size for model performance.

Findings

01

Variant proximity to fine-tuning data is more crucial than data size.

02

Models trained on mixed variants perform competitively across tasks.

03

Optimized system selection benefits from considering language variant proximity.

Abstract

In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CAMeL-Lab/CAMeLBERT
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification