Qalb: Largest State-of-the-Art Urdu Large Language Model for 230M Speakers with Systematic Continued Pre-training
Muhammad Taimoor Hassan, Jawad Ahmed, Muhammad Awais

TL;DR
Qalb is a new large language model specifically designed for Urdu, developed through continued pre-training and fine-tuning, achieving state-of-the-art results on diverse Urdu NLP tasks and demonstrating effective adaptation of foundation models to low-resource languages.
Contribution
The paper introduces Qalb, the largest Urdu language model, created via a two-stage process of continued pre-training and supervised fine-tuning, significantly improving performance on Urdu NLP benchmarks.
Findings
Qalb outperforms previous models with a 90.34 score on Urdu benchmarks.
Continued pre-training on diverse Urdu data enhances model performance.
Targeted fine-tuning yields state-of-the-art results across multiple NLP tasks.
Abstract
Despite remarkable progress in large language models, Urdu-a language spoken by over 230 million people-remains critically underrepresented in modern NLP systems. Existing multilingual models demonstrate poor performance on Urdu-specific tasks, struggling with the language's complex morphology, right-to-left Nastaliq script, and rich literary traditions. Even the base LLaMA-3.1 8B-Instruct model shows limited capability in generating fluent, contextually appropriate Urdu text. We introduce Qalb, an Urdu language model developed through a two-stage approach: continued pre-training followed by supervised fine-tuning. Starting from LLaMA 3.1 8B, we perform continued pre-training on a dataset of 1.97 billion tokens. This corpus comprises 1.84 billion tokens of diverse Urdu text-spanning news archives, classical and contemporary literature, government documents, and social media-combined…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Natural Language Processing Techniques
