ERUPD -- English to Roman Urdu Parallel Dataset
Mohammed Furqan, Raahid Bin Khaja, Rayyan Habeeb

TL;DR
This paper introduces a large, high-quality parallel dataset for English-Roman Urdu translation, created through a hybrid approach combining synthetic and real conversational data, to support NLP tasks involving Roman Urdu.
Contribution
The paper presents a novel, extensive Roman Urdu-English parallel dataset developed using hybrid data generation and human refinement, addressing the lack of standardized resources for this language variant.
Findings
Created 75,146 sentence pairs dataset
Hybrid approach improves data quality and diversity
Dataset enhances NLP applications involving Roman Urdu
Abstract
Bridging linguistic gaps fosters global growth and cultural exchange. This study addresses the challenges of Roman Urdu -- a Latin-script adaptation of Urdu widely used in digital communication -- by creating a novel parallel dataset comprising 75,146 sentence pairs. Roman Urdu's lack of standardization, phonetic variability, and code-switching with English complicates language processing. We tackled this by employing a hybrid approach that combines synthetic data generated via advanced prompt engineering with real-world conversational data from personal messaging groups. We further refined the dataset through a human evaluation phase, addressing linguistic inconsistencies and ensuring accuracy in code-switching, phonetic representations, and synonym variability. The resulting dataset captures Roman Urdu's diverse linguistic features and serves as a critical resource for machine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
