UQA: Corpus for Urdu Question Answering
Samee Arif, Sualeha Farid, Awais Athar, Agha Ali Raza

TL;DR
This paper presents UQA, a new Urdu question answering dataset created by translating SQuAD2.0 using the EATS method, and benchmarks multilingual QA models on it, demonstrating its utility for NLP research in low-resource languages.
Contribution
The paper introduces UQA, a high-quality Urdu QA dataset generated via EATS, and evaluates state-of-the-art models, advancing multilingual NLP for low-resource languages.
Findings
XLM-RoBERTa-XL achieved an F1 score of 85.99.
EATS effectively preserves answer spans in translation.
UQA enhances cross-lingual transfer for Urdu NLP.
Abstract
This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu, a low-resource language with over 70 million native speakers. UQA is generated by translating the Stanford Question Answering Dataset (SQuAD2.0), a large-scale English QA dataset, using a technique called EATS (Enclose to Anchor, Translate, Seek), which preserves the answer spans in the translated context paragraphs. The paper describes the process of selecting and evaluating the best translation model among two candidates: Google Translator and Seamless M4T. The paper also benchmarks several state-of-the-art multilingual QA models on UQA, including mBERT, XLM-RoBERTa, and mT5, and reports promising results. For XLM-RoBERTa-XL, we have an F1 score of 85.99 and 74.56 EM. UQA is a valuable resource for developing and testing multilingual NLP systems for Urdu and for enhancing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsAttention Is All You Need · Dense Connections · Adafactor · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Gated Linear Unit · Attention Dropout · Residual Connection · Softmax · Byte Pair Encoding
