IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning   Datasets for Indian Languages

Mohammed Safi Ur Rahman Khan; Priyam Mehta; Ananth Sankar; Umashankar; Kumaravelan; Sumanth Doddapaneni; Suriyaprasaad B; Varun Balan G; Sparsh; Jain; Anoop Kunchukuttan; Pratyush Kumar; Raj Dabre; Mitesh M. Khapra

arXiv:2403.06350·cs.CL·December 2, 2024·1 cites

IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages

Mohammed Safi Ur Rahman Khan, Priyam Mehta, Ananth Sankar, Umashankar, Kumaravelan, Sumanth Doddapaneni, Suriyaprasaad B, Varun Balan G, Sparsh, Jain, Anoop Kunchukuttan, Pratyush Kumar, Raj Dabre, Mitesh M. Khapra

PDF

Open Access 1 Repo 2 Models 3 Datasets

TL;DR

This paper introduces a comprehensive, open-source dataset and pipeline for pre-training and fine-tuning large language models in 22 Indian languages, addressing resource scarcity and promoting inclusive NLP development.

Contribution

It provides an extensive, curated dataset suite and a reproducible pipeline for Indic LLMs, including data collection, cleaning, translation, and toxicity alignment, with open-source release.

Findings

01

Created 251B tokens and 74.8M instruction-response pairs.

02

Developed a pipeline combining curated, unverified, and synthetic data.

03

Addressed toxicity alignment with generated prompts and responses.

Abstract

Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-response pairs. Recognizing the importance of both data quality and quantity, our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data. We build a clean, open-source pipeline for curating pre-training data from diverse sources, including websites, PDFs, and videos, incorporating best practices for crawling, cleaning, flagging, and deduplication. For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ai4bharat/indicllmsuite
noneOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling