Critical Survey of the Freely Available Arabic Corpora
Wajdi Zaghouani

TL;DR
This paper surveys and compiles a comprehensive, updated list of 66 freely available Arabic corpora and resources, facilitating NLP research for Arabic language processing.
Contribution
It provides the first extensive, categorized, and accessible compilation of free Arabic corpora with direct links, addressing a significant resource gap.
Findings
Identified 66 free Arabic corpora sources
Categorized resources for easier access
Provided direct download links where available
Abstract
The availability of corpora is a major factor in building natural language processing applications. However, the costs of acquiring corpora can prevent some researchers from going further in their endeavours. The ease of access to freely available corpora is urgent needed in the NLP research community especially for language such as Arabic. Currently, there is not easy was to access to a comprehensive and updated list of freely available Arabic corpora. We present in this paper, the results of a recent survey conducted to identify the list of the freely available Arabic corpora and language resources. Our preliminary results showed an initial list of 66 sources. We presents our findings in the various categories studied and we provided the direct links to get the data when possible.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
