CURE: Collection for Urdu Information Retrieval Evaluation and Ranking
Muntaha Iqbal, Kamran Amjad, Bilal Tahir, Muhammad Amir Mehmood

TL;DR
This paper introduces CURE, the first standardized Urdu IR evaluation collection, enabling consistent assessment of IR models and NLP techniques for Urdu, a language with unique morphological features and a large speaker base.
Contribution
The work constructs and evaluates the first standardized Urdu IR test collection, including document selection, relevance judgment, and language resources for lemmatization and query expansion.
Findings
Evaluation results show the effectiveness of lemmatization and query expansion.
Error analysis provides insights into model performance for Urdu IR.
The collection facilitates future research in Urdu information retrieval.
Abstract
Urdu is a widely spoken language with 163 million speakers worldwide across the globe. Information Retrieval (IR) for Urdu entails special consideration of research community due to its rich morphological features and a large number of speakers. In general, IR evaluation task is not extensively explored for Urdu. The most important missing element is the availability of a standardized evaluation corpus specific to Urdu. In this research work, we propose and construct a standard test collection of Urdu documents for IR evaluation and named it Collection for Urdu Retrieval Evaluation (CURE). We select 1,096 unique documents against 50 diverse queries from a large collection of 0.5 million crawled documents using two IR models. The purpose of test collection is the evaluation of IR models, ranking algorithms, and different natural language processing techniques. Next, we perform binary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
