AlcLaM: Arabic Dialectal Language Model
Murtadha Ahmed, Saghir Alfasly, Bo Wen, Jamaal Qasem, Mohammed Ahmed,, Yunfeng Liu

TL;DR
AlcLaM is a new Arabic dialectal language model trained on a specialized social media corpus, achieving superior NLP task performance with significantly less data than existing models.
Contribution
This paper introduces AlcLaM, an Arabic dialectal language model trained from scratch on a social media corpus, addressing limitations of standard Arabic models on dialects.
Findings
AlcLaM outperforms existing models on various NLP tasks.
It requires only 13 GB of training data, less than other models.
Demonstrates effectiveness of dialect-specific training data.
Abstract
Pre-trained Language Models (PLMs) are integral to many modern natural language processing (NLP) systems. Although multilingual models cover a wide range of languages, they often grapple with challenges like high inference costs and a lack of diverse non-English training data. Arabic-specific PLMs are trained predominantly on modern standard Arabic, which compromises their performance on regional dialects. To tackle this, we construct an Arabic dialectal corpus comprising 3.4M sentences gathered from social media platforms. We utilize this corpus to expand the vocabulary and retrain a BERT-based model from scratch. Named AlcLaM, our model was trained using only 13 GB of text, which represents a fraction of the data used by existing models such as CAMeL, MARBERT, and ArBERT, compared to 7.8%, 10.2%, and 21.3%, respectively. Remarkably, AlcLaM demonstrates superior performance on a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage, Linguistics, Cultural Analysis · Natural Language Processing Techniques · Arabic Language Education Studies
