ERUPD -- English to Roman Urdu Parallel Dataset

Mohammed Furqan; Raahid Bin Khaja; Rayyan Habeeb

arXiv:2412.17562·cs.CL·December 24, 2024

ERUPD -- English to Roman Urdu Parallel Dataset

Mohammed Furqan, Raahid Bin Khaja, Rayyan Habeeb

PDF

Open Access

TL;DR

This paper introduces a large, high-quality parallel dataset for English-Roman Urdu translation, created through a hybrid approach combining synthetic and real conversational data, to support NLP tasks involving Roman Urdu.

Contribution

The paper presents a novel, extensive Roman Urdu-English parallel dataset developed using hybrid data generation and human refinement, addressing the lack of standardized resources for this language variant.

Findings

01

Created 75,146 sentence pairs dataset

02

Hybrid approach improves data quality and diversity

03

Dataset enhances NLP applications involving Roman Urdu

Abstract

Bridging linguistic gaps fosters global growth and cultural exchange. This study addresses the challenges of Roman Urdu -- a Latin-script adaptation of Urdu widely used in digital communication -- by creating a novel parallel dataset comprising 75,146 sentence pairs. Roman Urdu's lack of standardization, phonetic variability, and code-switching with English complicates language processing. We tackled this by employing a hybrid approach that combines synthetic data generated via advanced prompt engineering with real-world conversational data from personal messaging groups. We further refined the dataset through a human evaluation phase, addressing linguistic inconsistencies and ensuring accuracy in code-switching, phonetic representations, and synonym variability. The resulting dataset captures Roman Urdu's diverse linguistic features and serves as a critical resource for machine…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling