Krutrim LLM: A Novel Tokenization Strategy for Multilingual Indic Languages with Petabyte-Scale Data Processing
Rahul Kumar, Shubham Kakde, Divyansh Rajput, Daud Ibrahim, Rishabh, Nahata, Pidathala Sowjanya, Deepak Kumarr, Gautam Bhargava, Chandra Khatri

TL;DR
This paper introduces Krutrim LLM, a multilingual Indic language model with a novel tokenization strategy, developed through meticulous data collection, deduplication, and custom tokenizer training, resulting in improved performance over existing tokenizers.
Contribution
The paper presents a new tokenization method and data processing pipeline specifically designed for multilingual Indic languages, enhancing model performance.
Findings
Custom Indic tokenizer outperforms OpenAI Tiktoken in token-to-word ratio.
Deduplication reduces redundancy in web crawl data by 70%.
High-quality, diverse data improves multilingual Indic language modeling.
Abstract
We present a novel approach to data preparation for developing multilingual Indic large language model. Our meticulous data acquisition spans open-source and proprietary sources, including Common Crawl, Indic books, news articles, and Wikipedia, ensuring a diverse and rich linguistic representation. For each Indic language, we design a custom preprocessing pipeline to effectively eliminate redundant and low-quality text content. Additionally, we perform deduplication on Common Crawl data to address the redundancy present in 70% of the crawled web pages. This study focuses on developing high-quality data, optimizing tokenization for our multilingual dataset for Indic large language models with 3B and 7B parameters, engineered for superior performance in Indic languages. We introduce a novel multilingual tokenizer training strategy, demonstrating our custom-trained Indic tokenizer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
