The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models
Go Inoue, Bashar Alhafni, Nurpeiis Baimukan, Houda Bouamor, Nizar, Habash

TL;DR
This study investigates how language variant, data size, and task type influence Arabic pre-trained language models, revealing that variant proximity to fine-tuning data outweighs data size in importance.
Contribution
It introduces multiple Arabic language models across variants and sizes, and demonstrates the significance of variant proximity over data size for model performance.
Findings
Variant proximity to fine-tuning data is more crucial than data size.
Models trained on mixed variants perform competitively across tasks.
Optimized system selection benefits from considering language variant proximity.
Abstract
In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗CAMeL-Lab/bert-base-arabic-camelbert-ca-nermodel· 59 dl· ♡ 259 dl♡ 2
- 🤗CAMeL-Lab/bert-base-arabic-camelbert-ca-poetrymodel· 13 dl· ♡ 413 dl♡ 4
- 🤗CAMeL-Lab/bert-base-arabic-camelbert-ca-pos-egymodel· 25 dl· ♡ 325 dl♡ 3
- 🤗CAMeL-Lab/bert-base-arabic-camelbert-ca-pos-glfmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗CAMeL-Lab/bert-base-arabic-camelbert-ca-pos-msamodel· 12 dl12 dl
- 🤗CAMeL-Lab/bert-base-arabic-camelbert-ca-sentimentmodel· 256 dl· ♡ 3256 dl♡ 3
- 🤗CAMeL-Lab/bert-base-arabic-camelbert-camodel· 1.7k dl· ♡ 131.7k dl♡ 13
- 🤗CAMeL-Lab/bert-base-arabic-camelbert-da-nermodel· 86 dl86 dl
- 🤗CAMeL-Lab/bert-base-arabic-camelbert-da-poetrymodel· 9 dl9 dl
- 🤗CAMeL-Lab/bert-base-arabic-camelbert-da-pos-egymodel· 5 dl· ♡ 15 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
