PARAM-1 BharatGen 2.9B Model

Kundeshwar Pundalik; Piyush Sawarkar; Nihar Sahoo; Abhishek Shinde; Prateek Chanda; Vedant Goswami; Ajay Nagpal; Atul Singh; Viraj Thakur; Vijay Dewane; Aamod Thakur; Bhargav Patel; Smita Gautam; Bhagwan Panditi; Shyam Pawar; Madhav Kotcha; Suraj Racha; Saral Sureka; Pankaj Singh; Rishi Bal; Rohit Saluja; Ganesh Ramakrishnan

arXiv:2507.13390·cs.CL·July 21, 2025

PARAM-1 BharatGen 2.9B Model

Kundeshwar Pundalik, Piyush Sawarkar, Nihar Sahoo, Abhishek Shinde, Prateek Chanda, Vedant Goswami, Ajay Nagpal, Atul Singh, Viraj Thakur, Vijay Dewane, Aamod Thakur, Bhargav Patel, Smita Gautam, Bhagwan Panditi, Shyam Pawar, Madhav Kotcha, Suraj Racha, Saral Sureka

PDF

Open Access 2 Models

TL;DR

PARAM-1 is a 2.9B parameter language model specifically designed to represent and process Indian languages and dialects, emphasizing linguistic diversity and cultural relevance in its architecture and training data.

Contribution

It introduces a culturally and linguistically focused Indian language model with equitable representation, adapted tokenization, and diverse benchmarks, setting a new standard for inclusive foundation models.

Findings

01

Achieves strong performance on Indian language tasks.

02

Demonstrates robustness in code-mixed and socio-linguistic contexts.

03

Provides a blueprint for equitable multilingual model design.

Abstract

Large Language Models (LLMs) have emerged as powerful general-purpose reasoning systems, yet their development remains dominated by English-centric data, architectures, and optimization paradigms. This exclusionary design results in structural under-representation of linguistically diverse regions such as India, where over 20 official languages and 100+ dialects coexist alongside phenomena like code-switching and diglossia. We introduce PARAM-1, a 2.9B parameter decoder-only, text-only language model trained from scratch with an explicit architectural and linguistic focus on Indian diversity. PARAM-1 is trained on a bilingual dataset consisting of only Hindi and English, constructed with a strong focus on fact-rich, high-quality content. It is guided by three core principles: equitable representation of Indic languages through a 25% corpus allocation; tokenization fairness via a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications