A Large and Diverse Arabic Corpus for Language Modeling
Abbas Raza Ali, Muhammad Ajmal Siddiqui, Rema Algunaibet, Hasan, Raza Ali

TL;DR
This paper introduces the largest clean and diverse Arabic corpus to enhance language modeling, resulting in significant performance improvements on NLP tasks compared to multilingual BERT.
Contribution
It presents a large, high-quality Arabic corpus and demonstrates its effectiveness in training a language model that outperforms existing multilingual models.
Findings
Arabic LM shows 4.5 to 8.5% improvement on NLP tasks.
The corpus is over 500 GB of cleaned Arabic text.
This is the largest Arabic corpus to date.
Abstract
Language models (LMs) have introduced a major paradigm shift in Natural Language Processing (NLP) modeling where large pre-trained LMs became integral to most of the NLP tasks. The LMs are intelligent enough to find useful and relevant representations of the language without any supervision. Perhaps, these models are used to fine-tune typical NLP tasks with significantly high accuracy as compared to the traditional approaches. Conversely, the training of these models requires a massively large corpus that is a good representation of the language. English LMs generally perform better than their other language counterparts, due to the availability of massive English corpora. This work elaborates on the design and development of a large Arabic corpus. It consists of over 500 GB of Arabic cleaned text targeted at improving cross-domain knowledge and downstream generalization capability of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Weight Decay · Linear Warmup With Linear Decay · Softmax
