IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages
Mohammed Safi Ur Rahman Khan, Priyam Mehta, Ananth Sankar, Umashankar, Kumaravelan, Sumanth Doddapaneni, Suriyaprasaad B, Varun Balan G, Sparsh, Jain, Anoop Kunchukuttan, Pratyush Kumar, Raj Dabre, Mitesh M. Khapra

TL;DR
This paper introduces a comprehensive, open-source dataset and pipeline for pre-training and fine-tuning large language models in 22 Indian languages, addressing resource scarcity and promoting inclusive NLP development.
Contribution
It provides an extensive, curated dataset suite and a reproducible pipeline for Indic LLMs, including data collection, cleaning, translation, and toxicity alignment, with open-source release.
Findings
Created 251B tokens and 74.8M instruction-response pairs.
Developed a pipeline combining curated, unverified, and synthetic data.
Addressed toxicity alignment with generated prompts and responses.
Abstract
Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-response pairs. Recognizing the importance of both data quality and quantity, our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data. We build a clean, open-source pipeline for curating pre-training data from diverse sources, including websites, PDFs, and videos, incorporating best practices for crawling, cleaning, flagging, and deduplication. For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
