BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages

Guduru Manoj; Neel Prabhanjan Rachamalla; Ashish Kulkarni; Gautam Rajeev; Jay Piplodiya; Arul Menezes; Shaharukh Khan; Souvik Rana; Manya Sah; Chandra Khatri; and Shubham Agarwal

arXiv:2511.10338·cs.CL·November 18, 2025

BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages

Guduru Manoj, Neel Prabhanjan Rachamalla, Ashish Kulkarni, Gautam Rajeev, Jay Piplodiya, Arul Menezes, Shaharukh Khan, Souvik Rana, Manya Sah, Chandra Khatri, and Shubham Agarwal

PDF

Open Access 1 Datasets

TL;DR

This paper presents BhashaKritika, a large-scale synthetic dataset for Indic languages, and evaluates various data generation and quality control strategies to improve multilingual pretraining of language models.

Contribution

It introduces a systematic approach for generating and evaluating synthetic multilingual data for Indic languages, including a scalable quality assessment pipeline.

Findings

01

Synthetic data improves multilingual model performance.

02

Grounding generation in documents and personas enhances quality.

03

Best practices identified for constructing effective Indic language corpora.

Abstract

In the context of pretraining of Large Language Models (LLMs), synthetic data has emerged as an alternative for generating high-quality pretraining data at scale. This is particularly beneficial in low-resource language settings where the benefits of recent LLMs have been unevenly distributed across languages. In this work, we present a systematic study on the generation and evaluation of synthetic multilingual pretraining data for Indic languages, where we construct a large-scale synthetic dataset BhashaKritika, comprising 540B tokens using 5 different techniques for 10 languages. We explore the impact of grounding generation in documents, personas, and topics. We analyze how language choice, both in the prompt instructions and document grounding, affects data quality, and we compare translations of English content with native generation in Indic languages. To support scalable and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

krutrim-ai-labs/BhashaKritika
dataset· 815 dl
815 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification