UrduLM: A Resource-Efficient Monolingual Urdu Language Model
Syed Muhammad Ali, Hammad Sajid, Zainab Haider, Ali Muhammad Asad, Haya Fatima, Abdul Samad

TL;DR
UrduLM is a resource-efficient, monolingual Urdu language model trained on a curated 33GB corpus, achieving competitive results and providing an open baseline for Urdu NLP research.
Contribution
The paper introduces UrduLM, a novel low-resource monolingual Urdu language model with a custom tokenizer and comprehensive resources, addressing limitations of multilingual models.
Findings
UrduLM achieves 66.6% accuracy in sentiment classification.
BLEU scores exceed 30 on grammar correction tasks.
Model and resources are openly released for research use.
Abstract
Urdu, spoken by 230 million people worldwide, lacks dedicated transformer-based language models and curated corpora. While multilingual models provide limited Urdu support, they suffer from poor performance, high computational costs, and cultural inaccuracies due to insufficient training data. To address these challenges, we present UrduLM, a pretrained Urdu monolingual language model trained in low-resource settings. We curate a 33GB Urdu corpus from diverse sources, develop a custom BPE tokenizer that reduces tokenization overhead by atleast 20-30% compared to multilingual alternatives, and pretrain a 100M-parameter decoder-only model. In few-shot evaluations, UrduLM achieves competitive performance with multilingual models up to 30x its size, reaching 66.6% accuracy on sentiment classification and BLEU scores exceeding 30 on grammar correction tasks. The complete methodology --…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Topic Modeling · Natural Language Processing Techniques
