UrduLM: A Resource-Efficient Monolingual Urdu Language Model

Syed Muhammad Ali; Hammad Sajid; Zainab Haider; Ali Muhammad Asad; Haya Fatima; Abdul Samad

arXiv:2601.17664·cs.CL·January 27, 2026

UrduLM: A Resource-Efficient Monolingual Urdu Language Model

Syed Muhammad Ali, Hammad Sajid, Zainab Haider, Ali Muhammad Asad, Haya Fatima, Abdul Samad

PDF

Open Access

TL;DR

UrduLM is a resource-efficient, monolingual Urdu language model trained on a curated 33GB corpus, achieving competitive results and providing an open baseline for Urdu NLP research.

Contribution

The paper introduces UrduLM, a novel low-resource monolingual Urdu language model with a custom tokenizer and comprehensive resources, addressing limitations of multilingual models.

Findings

01

UrduLM achieves 66.6% accuracy in sentiment classification.

02

BLEU scores exceed 30 on grammar correction tasks.

03

Model and resources are openly released for research use.

Abstract

Urdu, spoken by 230 million people worldwide, lacks dedicated transformer-based language models and curated corpora. While multilingual models provide limited Urdu support, they suffer from poor performance, high computational costs, and cultural inaccuracies due to insufficient training data. To address these challenges, we present UrduLM, a pretrained Urdu monolingual language model trained in low-resource settings. We curate a 33GB Urdu corpus from diverse sources, develop a custom BPE tokenizer that reduces tokenization overhead by atleast 20-30% compared to multilingual alternatives, and pretrain a 100M-parameter decoder-only model. In few-shot evaluations, UrduLM achieves competitive performance with multilingual models up to 30x its size, reaching 66.6% accuracy on sentiment classification and BLEU scores exceeding 30 on grammar correction tasks. The complete methodology --…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSentiment Analysis and Opinion Mining · Topic Modeling · Natural Language Processing Techniques