A Survey of Large Language Models for Arabic Language and its Dialects
Malak Mashaabi, Shahad Al-Khalifa, Hend Al-Khalifa

TL;DR
This survey comprehensively reviews Arabic language Large Language Models, their architectures, datasets, performance, openness, challenges, and future research directions.
Contribution
It provides the first extensive overview of Arabic LLMs, analyzing architectures, datasets, performance, openness, and future challenges in a unified framework.
Findings
Arabic LLMs vary widely in architecture and dataset sources.
Openness of Arabic LLMs is limited by data and code availability.
Future research needs more diverse dialectal datasets and transparency.
Abstract
This survey offers a comprehensive overview of Large Language Models (LLMs) designed for Arabic language and its dialects. It covers key architectures, including encoder-only, decoder-only, and encoder-decoder models, along with the datasets used for pre-training, spanning Classical Arabic, Modern Standard Arabic, and Dialectal Arabic. The study also explores monolingual, bilingual, and multilingual LLMs, analyzing their architectures and performance across downstream tasks, such as sentiment analysis, named entity recognition, and question answering. Furthermore, it assesses the openness of Arabic LLMs based on factors, such as source code availability, training data, model weights, and documentation. The survey highlights the need for more diverse dialectal datasets and attributes the importance of openness for research reproducibility and transparency. It concludes by identifying key…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
