UrduSpeech: A 156-Hour Urdu Speech Corpus with 12-Dimension Paralinguistic Annotations

Attia Nafees ul Haq; Zeyu Zhu; Jingbin Hu; ChunJiang He; Lei Xie

arXiv:2605.17846·eess.AS·May 19, 2026

UrduSpeech: A 156-Hour Urdu Speech Corpus with 12-Dimension Paralinguistic Annotations

Attia Nafees ul Haq, Zeyu Zhu, Jingbin Hu, ChunJiang He, Lei Xie

PDF

1 Repo 1 Datasets

TL;DR

UrduSpeech is a comprehensive 156-hour Urdu speech corpus with detailed paralinguistic annotations, addressing resource scarcity and enabling advanced speech technology development for Urdu.

Contribution

The paper introduces a large, high-quality Urdu speech corpus with paralinguistic metadata, curated using an LLM-driven pipeline, and provides a benchmark set for research.

Findings

01

Mean Opinion Score of 4.6 confirms high quality

02

97.6% confidence in data curation pipeline

03

Balanced gender representation across 71,792 utterances

Abstract

Despite 230 million speakers, Urdu remains critically under-resourced in speech technology. We introduce UrduSpeech: a large high-fidelity Urdu corpus comprising 156 hours of audio with 12-dimension paralinguistic metadata, encompassing US-Std, US-CS, US-EngPk. To address Right-to-Left script constraints and frequent code-switching, we developed UrduSpeech, a LLM-driven pipeline to curate data across 12 diverse categories, including news, drama, and rare literary forms like Bait-Bazi. We also release a 9-hour US-Benchmark set, manually corrected by native annotators to serve as a standard. Human quality assessment of the primary 156-hour corpus yielded a Mean Opinion Score (MOS) of 4.6 (std = 0.7) with inter-rater reliability confirmed by a 0.68 Cohen's Kappa, validating our curation pipeline's 97.6% confidence score. The corpus maintains a 60-40 gender balance across 71,792 utterances.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Datasets

ASLP-lab/UrduSpeech
dataset· 5.0k dl
5.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.