Lisan: Yemeni, Iraqi, Libyan, and Sudanese Arabic Dialect Copora with Morphological Annotations
Mustafa Jarrar, Fadi A Zaraket, Tymaa Hammouda, Daanish, Masood Alavi, Martin Waahlisch

TL;DR
This paper introduces four morphologically-annotated Arabic dialect corpora from Yemen, Iraq, Libya, and Sudan, totaling 1.2 million tokens, with annotations and an open-source toolkit for linguistic research.
Contribution
It provides the first large-scale, morphologically-annotated dialect corpora for Yemeni, Iraqi, Libyan, and Sudanese Arabic, along with a dedicated annotation toolkit.
Findings
Corpora contain 1.2 million tokens across four dialects.
Annotations include morphological features, POS, lemma, and gloss.
The ADAT toolkit is open source and facilitates consistent annotation.
Abstract
This article presents morphologically-annotated Yemeni, Sudanese, Iraqi, and Libyan Arabic dialects Lisan corpora. Lisan features around 1.2 million tokens. We collected the content of the corpora from several social media platforms. The Yemeni corpus (~ 1.05M tokens) was collected automatically from Twitter. The corpora of the other three dialects (~ 50K tokens each) came manually from Facebook and YouTube posts and comments. Thirty five (35) annotators who are native speakers of the target dialects carried out the annotations. The annotators segemented all words in the four corpora into prefixes, stems and suffixes and labeled each with different morphological features such as part of speech, lemma, and a gloss in English. An Arabic Dialect Annotation Toolkit ADAT was developped for the purpose of the annation. The annotators were trained on a set of guidelines and on how to use…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Authorship Attribution and Profiling
