Tarab: A Multi-Dialect Corpus of Arabic Lyrics and Poetry

Mo El-Haj

arXiv:2603.16601·cs.CL·March 18, 2026

Tarab: A Multi-Dialect Corpus of Arabic Lyrics and Poetry

Mo El-Haj

PDF

Open Access

TL;DR

The Tarab Corpus is the largest open Arabic dataset of lyrics and poetry, covering multiple dialects, eras, and styles, enabling diverse linguistic and cultural analyses.

Contribution

This paper introduces the Tarab Corpus, a comprehensive, multi-dialect Arabic linguistic resource combining classical and modern texts with detailed metadata.

Findings

01

Baseline analyses for dialect identification

02

Genre differentiation results

03

Dataset's potential for linguistic research

Abstract

We introduce the Tarab Corpus, a large-scale cultural and linguistic resource that brings together Arabic song lyrics and poetry within a unified analytical framework. The corpus comprises 2.56 million verses and more than 13.5 million tokens, making it, to our knowledge, the largest open Arabic corpus of creative text spanning both classical and contemporary production. Tarab is broadly balanced between songs and poems and covers Classical Arabic, Modern Standard Arabic (MSA), and six major regional varieties: Egyptian, Gulf, Levantine, Iraqi, Sudanese, and Maghrebi Arabic. The artists and poets represented in the corpus are associated with 28 modern nation states and multiple historical eras, covering over fourteen centuries of Arabic creative expression from the Pre-Islamic period to the twenty-first century. Each verse is accompanied by structured metadata describing linguistic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Language, Linguistics, Cultural Analysis · Medieval and Classical Philosophy