UQuAD1.0: Development of an Urdu Question Answering Training Data for Machine Reading Comprehension
Samreen Kazi (1), Shakeel Khoja (1) ((1) School of Mathematics &, Computer Science, Institute of Business Administration, Karachi Pakistan)

TL;DR
This paper introduces UQuAD1.0, a large-scale Urdu question answering dataset created through semi-automated methods, and evaluates Transformer-based models on it, achieving promising results for Urdu MRC.
Contribution
The paper presents the first large-scale Urdu QA dataset, UQuAD1.0, and demonstrates the effectiveness of Transformer models like XLMRoBERTa for Urdu MRC tasks.
Findings
Transformer models outperform rule-based baselines.
XLMRoBERTa achieves an F1 score of 0.66 on UQuAD1.0.
UQuAD1.0 enables future research in Urdu machine reading comprehension.
Abstract
In recent years, low-resource Machine Reading Comprehension (MRC) has made significant progress, with models getting remarkable performance on various language datasets. However, none of these models have been customized for the Urdu language. This work explores the semi-automated creation of the Urdu Question Answering Dataset (UQuAD1.0) by combining machine-translated SQuAD with human-generated samples derived from Wikipedia articles and Urdu RC worksheets from Cambridge O-level books. UQuAD1.0 is a large-scale Urdu dataset intended for extractive machine reading comprehension tasks consisting of 49k question Answers pairs in question, passage, and answer format. In UQuAD1.0, 45000 pairs of QA were generated by machine translation of the original SQuAD1.0 and approximately 4000 pairs via crowdsourcing. In this study, we used two types of MRC models: rule-based baseline and advanced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Residual Connection · Layer Normalization · Softmax · Weight Decay · WordPiece · Adam
