Studying the History of the Arabic Language: Language Technology and a   Large-Scale Historical Corpus

Yonatan Belinkov; Alexander Magidow; Alberto Barr\'on-Cede\~no; Avi; Shmidman; Maxim Romanov

arXiv:1809.03891·cs.CL·September 12, 2018

Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus

Yonatan Belinkov, Alexander Magidow, Alberto Barr\'on-Cede\~no, Avi, Shmidman, Maxim Romanov

PDF

1 Repo

TL;DR

This paper introduces a large-scale, 1400-year historical corpus of written Arabic, utilizing NLP tools and novel algorithms to analyze language evolution and refine periodization of Arabic language history.

Contribution

It presents the first extensive historical Arabic corpus and develops a new automatic periodization method to analyze language development over centuries.

Findings

01

Confirmed division into Modern Standard and Classical Arabic

02

Validated existing periodizations of Arabic history

03

Suggested further subdivisions within Arabic language development

Abstract

Arabic is a widely-spoken language with a long and rich history, but existing corpora and language technology focus mostly on modern Arabic and its varieties. Therefore, studying the history of the language has so far been mostly limited to manual analyses on a small scale. In this work, we present a large-scale historical corpus of the written Arabic language, spanning 1400 years. We describe our efforts to clean and process this corpus using Arabic NLP tools, including the identification of reused text. We study the history of the Arabic language using a novel automatic periodization algorithm, as well as other techniques. Our findings confirm the established division of written Arabic into Modern Standard and Classical Arabic, and confirm other established periodizations, while suggesting that written Arabic may be divisible into still further periods of development.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

boknilev/periodization
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.