101 Billion Arabic Words Dataset
Manel Aloui, Hasna Chouikhi, Ghaith Chaabane, Haithem Kchaou, Chehir, Dhaouadi

TL;DR
This paper introduces the 101 Billion Arabic Words Dataset, the largest Arabic language dataset to date, created through large-scale web data mining and cleaning, aiming to improve the authenticity and quality of Arabic language models.
Contribution
It presents a large-scale, rigorously cleaned Arabic dataset from web sources, addressing data scarcity and bias issues in Arabic language modeling.
Findings
Largest Arabic dataset available to date
Improved potential for authentic Arabic LLMs
Framework for future Arabic NLP research
Abstract
In recent years, Large Language Models have revolutionized the field of natural language processing, showcasing an impressive rise predominantly in English-centric domains. These advancements have set a global benchmark, inspiring significant efforts toward developing Arabic LLMs capable of understanding and generating the Arabic language with remarkable accuracy. Despite these advancements, a critical challenge persists: the potential bias in Arabic LLMs, primarily attributed to their reliance on datasets comprising English data that has been translated into Arabic. This reliance not only compromises the authenticity of the generated content but also reflects a broader issue -the scarcity of original quality Arabic linguistic data. This study aims to address the data scarcity in the Arab world and to encourage the development of Arabic Language Models that are true to both the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSpeak To Someone At United Airlines®: A Quick Guide · How Do I Reach Out Crypto.com Help Phone Number?? · 12 Ways To Call How To Speak To A Live Agent At Allegiant Airlines: A Step By Step Complete Guide · 40 Ways To Speak To Live Agent At WestJet Airlines Via Phone · Sparse Evolutionary Training
