Ramsa: A Large Sociolinguistically Rich Emirati Arabic Speech Corpus for ASR and TTS
Rania Al-Sabbagh

TL;DR
Ramsa is a comprehensive 41-hour Emirati Arabic speech corpus designed to facilitate sociolinguistic research and improve low-resource speech technologies, with baseline ASR and TTS performance evaluations.
Contribution
The paper introduces Ramsa, a large, sociolinguistically diverse Emirati Arabic speech corpus, and provides initial baseline results for ASR and TTS in a zero-shot setting.
Findings
Whisper-large-v3-turbo achieved 0.268 WER in ASR.
MMS-TTS-Ara achieved 0.285 WER in TTS.
The corpus reveals significant challenges and future research directions.
Abstract
Ramsa is a developing 41-hour speech corpus of Emirati Arabic designed to support sociolinguistic research and low-resource language technologies. It contains recordings from structured interviews with native speakers and episodes from national television shows. The corpus features 157 speakers (59 female, 98 male), spans subdialects such as Urban, Bedouin, and Mountain/Shihhi, and covers topics such as cultural heritage, agriculture and sustainability, daily life, professional trajectories, and architecture. It consists of 91 monologic and 79 dialogic recordings, varying in length and recording conditions. A 10\% subset was used to evaluate commercial and open-source models for automatic speech recognition (ASR) and text-to-speech (TTS) in a zero-shot setting to establish initial baselines. Whisper-large-v3-turbo achieved the best ASR performance, with average word and character error…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Linguistic Variation and Morphology · Phonetics and Phonology Research
