Revisiting Pre-trained Language Models and their Evaluation for Arabic Natural Language Understanding
Abbas Ghaddar, Yimeng Wu, Sunyam Bagga, Ahmad Rashid, Khalil Bibi,, Mehdi Rezagholizadeh, Chao Xing, Yasheng Wang, Duan Xinyu, Zhefeng Wang,, Baoxing Huai, Xin Jiang, Qun Liu, Philippe Langlais

TL;DR
This paper improves Arabic pre-trained language models by enhancing pre-training methods and systematically evaluating their performance, resulting in new models that outperform existing ones on Arabic NLU and NLG benchmarks.
Contribution
It introduces three new Arabic BERT-style models and two T5-style models, with systematic evaluation showing significant performance improvements over prior models.
Findings
New models outperform existing Arabic PLMs
Achieve state-of-the-art results on Arabic NLU tasks
Demonstrate the importance of data quality and model size
Abstract
There is a growing body of work in recent years to develop pre-trained language models (PLMs) for the Arabic language. This work concerns addressing two major problems in existing Arabic PLMs which constraint progress of the Arabic NLU and NLG fields.First, existing Arabic PLMs are not well-explored and their pre-trainig can be improved significantly using a more methodical approach. Second, there is a lack of systematic and reproducible evaluation of these models in the literature. In this work, we revisit both the pre-training and evaluation of Arabic PLMs. In terms of pre-training, we explore improving Arabic LMs from three perspectives: quality of the pre-training data, size of the model, and incorporating character-level information. As a result, we release three new Arabic BERT-style models ( JABER, Char-JABER, and SABER), and two T5-style models (AT5S and AT5B). In terms of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
