Qalb: Largest State-of-the-Art Urdu Large Language Model for 230M Speakers with Systematic Continued Pre-training

Muhammad Taimoor Hassan; Jawad Ahmed; Muhammad Awais

arXiv:2601.08141·cs.CL·January 14, 2026

Qalb: Largest State-of-the-Art Urdu Large Language Model for 230M Speakers with Systematic Continued Pre-training

Muhammad Taimoor Hassan, Jawad Ahmed, Muhammad Awais

PDF

Open Access 1 Models

TL;DR

Qalb is a new large language model specifically designed for Urdu, developed through continued pre-training and fine-tuning, achieving state-of-the-art results on diverse Urdu NLP tasks and demonstrating effective adaptation of foundation models to low-resource languages.

Contribution

The paper introduces Qalb, the largest Urdu language model, created via a two-stage process of continued pre-training and supervised fine-tuning, significantly improving performance on Urdu NLP benchmarks.

Findings

01

Qalb outperforms previous models with a 90.34 score on Urdu benchmarks.

02

Continued pre-training on diverse Urdu data enhances model performance.

03

Targeted fine-tuning yields state-of-the-art results across multiple NLP tasks.

Abstract

Despite remarkable progress in large language models, Urdu-a language spoken by over 230 million people-remains critically underrepresented in modern NLP systems. Existing multilingual models demonstrate poor performance on Urdu-specific tasks, struggling with the language's complex morphology, right-to-left Nastaliq script, and rich literary traditions. Even the base LLaMA-3.1 8B-Instruct model shows limited capability in generating fluent, contextually appropriate Urdu text. We introduce Qalb, an Urdu language model developed through a two-stage approach: continued pre-training followed by supervised fine-tuning. Starting from LLaMA 3.1 8B, we perform continued pre-training on a dataset of 1.97 billion tokens. This corpus comprises 1.84 billion tokens of diverse Urdu text-spanning news archives, classical and contemporary literature, government documents, and social media-combined…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
enstazao/Qalb-1.0-8B-Instruct
model· 706 dl· ♡ 15
706 dl♡ 15

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Natural Language Processing Techniques