Krutrim LLM: A Novel Tokenization Strategy for Multilingual Indic   Languages with Petabyte-Scale Data Processing

Rahul Kumar; Shubham Kakde; Divyansh Rajput; Daud Ibrahim; Rishabh; Nahata; Pidathala Sowjanya; Deepak Kumarr; Gautam Bhargava; Chandra Khatri

arXiv:2407.12481·cs.CL·April 2, 2025·1 cites

Krutrim LLM: A Novel Tokenization Strategy for Multilingual Indic Languages with Petabyte-Scale Data Processing

Rahul Kumar, Shubham Kakde, Divyansh Rajput, Daud Ibrahim, Rishabh, Nahata, Pidathala Sowjanya, Deepak Kumarr, Gautam Bhargava, Chandra Khatri

PDF

Open Access

TL;DR

This paper introduces Krutrim LLM, a multilingual Indic language model with a novel tokenization strategy, developed through meticulous data collection, deduplication, and custom tokenizer training, resulting in improved performance over existing tokenizers.

Contribution

The paper presents a new tokenization method and data processing pipeline specifically designed for multilingual Indic languages, enhancing model performance.

Findings

01

Custom Indic tokenizer outperforms OpenAI Tiktoken in token-to-word ratio.

02

Deduplication reduces redundancy in web crawl data by 70%.

03

High-quality, diverse data improves multilingual Indic language modeling.

Abstract

We present a novel approach to data preparation for developing multilingual Indic large language model. Our meticulous data acquisition spans open-source and proprietary sources, including Common Crawl, Indic books, news articles, and Wikipedia, ensuring a diverse and rich linguistic representation. For each Indic language, we design a custom preprocessing pipeline to effectively eliminate redundant and low-quality text content. Additionally, we perform deduplication on Common Crawl data to address the redundancy present in 70% of the crawled web pages. This study focuses on developing high-quality data, optimizing tokenization for our multilingual dataset for Indic large language models with 3B and 7B parameters, engineered for superior performance in Indic languages. We introduce a novel multilingual tokenizer training strategy, demonstrating our custom-trained Indic tokenizer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques