Enabling Low-Resource Language Retrieval: Establishing Baselines for   Urdu MS MARCO

Umer Butt; Stalin Varanasi; G\"unter Neumann

arXiv:2412.12997·cs.CL·April 7, 2025

Enabling Low-Resource Language Retrieval: Establishing Baselines for Urdu MS MARCO

Umer Butt, Stalin Varanasi, G\"unter Neumann

PDF

Open Access 1 Repo

TL;DR

This paper introduces the first large-scale Urdu IR dataset created via machine translation of MS MARCO, establishing baselines and demonstrating improved retrieval performance for Urdu, a low-resource language.

Contribution

It presents the first Urdu IR dataset and baseline results, applying multilingual IR methods to improve retrieval in low-resource Urdu language.

Findings

01

Fine-tuned Urdu-mT5-mMARCO achieves MRR@10 of 0.247

02

Zero-shot results are significantly improved by fine-tuning

03

The dataset and methods support inclusive IR for low-resource languages

Abstract

As the Information Retrieval (IR) field increasingly recognizes the importance of inclusivity, addressing the needs of low-resource languages remains a significant challenge. This paper introduces the first large-scale Urdu IR dataset, created by translating the MS MARCO dataset through machine translation. We establish baseline results through zero-shot learning for IR in Urdu and subsequently apply the mMARCO multilingual IR methodology to this newly translated dataset. Our findings demonstrate that the fine-tuned model (Urdu-mT5-mMARCO) achieves a Mean Reciprocal Rank (MRR@10) of 0.247 and a Recall@10 of 0.439, representing significant improvements over zero-shot results and showing the potential for expanding IR access for Urdu speakers. By bridging access gaps for speakers of low-resource languages, this work not only advances multilingual IR research but also emphasizes the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

UmerTariq1/Urdu_MsMarco_Translation_Retrieval
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling