1.5 billion words Arabic Corpus
Ibrahim Abu El-khair

TL;DR
This paper presents the creation of a large, contemporary Arabic language corpus comprising over 1.5 billion words from newspaper articles across eight countries, encoded in multiple formats for linguistic research.
Contribution
It introduces a substantial, multi-source Arabic corpus with diverse encoding and markup formats, facilitating advanced linguistic and computational studies.
Findings
Corpus contains over 1.5 billion words and 3 million unique words.
Data collected from ten major news sources over fourteen years.
Includes multiple encoding and markup formats for broad usability.
Abstract
This study is an attempt to build a contemporary linguistic corpus for Arabic language. The corpus produced, is a text corpus includes more than five million newspaper articles. It contains over a billion and a half words in total, out of which, there is about three million unique words. The data were collected from newspaper articles in ten major news sources from eight Arabic countries, over a period of fourteen years. The corpus was encoded with two types of encoding, namely: UTF-8, and Windows CP-1256. Also it was marked with two mark-up languages, namely: SGML, and XML.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage, Linguistics, Cultural Analysis · Natural Language Processing Techniques · Historical and Linguistic Studies
