1.5 billion words Arabic Corpus

Ibrahim Abu El-khair

arXiv:1611.04033·cs.CL·November 15, 2016·57 cites

1.5 billion words Arabic Corpus

Ibrahim Abu El-khair

PDF

Open Access 1 Models 3 Datasets

TL;DR

This paper presents the creation of a large, contemporary Arabic language corpus comprising over 1.5 billion words from newspaper articles across eight countries, encoded in multiple formats for linguistic research.

Contribution

It introduces a substantial, multi-source Arabic corpus with diverse encoding and markup formats, facilitating advanced linguistic and computational studies.

Findings

01

Corpus contains over 1.5 billion words and 3 million unique words.

02

Data collected from ten major news sources over fourteen years.

03

Includes multiple encoding and markup formats for broad usability.

Abstract

This study is an attempt to build a contemporary linguistic corpus for Arabic language. The corpus produced, is a text corpus includes more than five million newspaper articles. It contains over a billion and a half words in total, out of which, there is about three million unique words. The data were collected from newspaper articles in ten major news sources from eight Arabic countries, over a period of fourteen years. The corpus was encoded with two types of encoding, namely: UTF-8, and Windows CP-1256. Also it was marked with two mark-up languages, namely: SGML, and XML.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
laithAzzam/Belle
model

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage, Linguistics, Cultural Analysis · Natural Language Processing Techniques · Historical and Linguistic Studies