Tarab: A Multi-Dialect Corpus of Arabic Lyrics and Poetry
Mo El-Haj

TL;DR
The Tarab Corpus is the largest open Arabic dataset of lyrics and poetry, covering multiple dialects, eras, and styles, enabling diverse linguistic and cultural analyses.
Contribution
This paper introduces the Tarab Corpus, a comprehensive, multi-dialect Arabic linguistic resource combining classical and modern texts with detailed metadata.
Findings
Baseline analyses for dialect identification
Genre differentiation results
Dataset's potential for linguistic research
Abstract
We introduce the Tarab Corpus, a large-scale cultural and linguistic resource that brings together Arabic song lyrics and poetry within a unified analytical framework. The corpus comprises 2.56 million verses and more than 13.5 million tokens, making it, to our knowledge, the largest open Arabic corpus of creative text spanning both classical and contemporary production. Tarab is broadly balanced between songs and poems and covers Classical Arabic, Modern Standard Arabic (MSA), and six major regional varieties: Egyptian, Gulf, Levantine, Iraqi, Sudanese, and Maghrebi Arabic. The artists and poets represented in the corpus are associated with 28 modern nation states and multiple historical eras, covering over fourteen centuries of Arabic creative expression from the Pre-Islamic period to the twenty-first century. Each verse is accompanied by structured metadata describing linguistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Language, Linguistics, Cultural Analysis · Medieval and Classical Philosophy
