Enabling Low-Resource Language Retrieval: Establishing Baselines for Urdu MS MARCO
Umer Butt, Stalin Varanasi, G\"unter Neumann

TL;DR
This paper introduces the first large-scale Urdu IR dataset created via machine translation of MS MARCO, establishing baselines and demonstrating improved retrieval performance for Urdu, a low-resource language.
Contribution
It presents the first Urdu IR dataset and baseline results, applying multilingual IR methods to improve retrieval in low-resource Urdu language.
Findings
Fine-tuned Urdu-mT5-mMARCO achieves MRR@10 of 0.247
Zero-shot results are significantly improved by fine-tuning
The dataset and methods support inclusive IR for low-resource languages
Abstract
As the Information Retrieval (IR) field increasingly recognizes the importance of inclusivity, addressing the needs of low-resource languages remains a significant challenge. This paper introduces the first large-scale Urdu IR dataset, created by translating the MS MARCO dataset through machine translation. We establish baseline results through zero-shot learning for IR in Urdu and subsequently apply the mMARCO multilingual IR methodology to this newly translated dataset. Our findings demonstrate that the fine-tuned model (Urdu-mT5-mMARCO) achieves a Mean Reciprocal Rank (MRR@10) of 0.247 and a Recall@10 of 0.439, representing significant improvements over zero-shot results and showing the potential for expanding IR access for Urdu speakers. By bridging access gaps for speakers of low-resource languages, this work not only advances multilingual IR research but also emphasizes the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
