AlcLaM: Arabic Dialectal Language Model

Murtadha Ahmed; Saghir Alfasly; Bo Wen; Jamaal Qasem; Mohammed Ahmed,; Yunfeng Liu

arXiv:2407.13097·cs.CL·July 19, 2024

AlcLaM: Arabic Dialectal Language Model

Murtadha Ahmed, Saghir Alfasly, Bo Wen, Jamaal Qasem, Mohammed Ahmed,, Yunfeng Liu

PDF

Open Access 1 Repo 4 Models

TL;DR

AlcLaM is a new Arabic dialectal language model trained on a specialized social media corpus, achieving superior NLP task performance with significantly less data than existing models.

Contribution

This paper introduces AlcLaM, an Arabic dialectal language model trained from scratch on a social media corpus, addressing limitations of standard Arabic models on dialects.

Findings

01

AlcLaM outperforms existing models on various NLP tasks.

02

It requires only 13 GB of training data, less than other models.

03

Demonstrates effectiveness of dialect-specific training data.

Abstract

Pre-trained Language Models (PLMs) are integral to many modern natural language processing (NLP) systems. Although multilingual models cover a wide range of languages, they often grapple with challenges like high inference costs and a lack of diverse non-English training data. Arabic-specific PLMs are trained predominantly on modern standard Arabic, which compromises their performance on regional dialects. To tackle this, we construct an Arabic dialectal corpus comprising 3.4M sentences gathered from social media platforms. We utilize this corpus to expand the vocabulary and retrain a BERT-based model from scratch. Named AlcLaM, our model was trained using only 13 GB of text, which represents a fraction of the data used by existing models such as CAMeL, MARBERT, and ArBERT, compared to 7.8%, 10.2%, and 21.3%, respectively. Remarkably, AlcLaM demonstrates superior performance on a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amurtadha/alclam
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage, Linguistics, Cultural Analysis · Natural Language Processing Techniques · Arabic Language Education Studies