Krutrim LLM: Multilingual Foundational Model for over a Billion People

Aditya Kallappa; Palash Kamble; Abhinav Ravi; Akshat Patidar; Vinayak; Dhruv; Deepak Kumar; Raghav Awasthi; Arveti Manjunath; Himanshu Gupta,; Shubham Agarwal; Kumar Ashish; Gautam Bhargava; Chandra Khatri

arXiv:2502.09642·cs.CL·February 25, 2025·2 cites

Krutrim LLM: Multilingual Foundational Model for over a Billion People

Aditya Kallappa, Palash Kamble, Abhinav Ravi, Akshat Patidar, Vinayak, Dhruv, Deepak Kumar, Raghav Awasthi, Arveti Manjunath, Himanshu Gupta,, Shubham Agarwal, Kumar Ashish, Gautam Bhargava, Chandra Khatri

PDF

Open Access 1 Repo

TL;DR

Krutrim LLM is a large multilingual model tailored for India's diverse languages, achieving strong performance across Indic languages and English, and integrated with real-time search to enhance factual accuracy for over a billion users.

Contribution

Introduces Krutrim LLM, a 2 trillion token multilingual model with the largest Indic dataset, addressing data scarcity and linguistic diversity challenges in India.

Findings

01

Outperforms or matches state-of-the-art models on Indic benchmarks.

02

Achieves comparable performance to LLAMA-2 on 10 out of 16 tasks.

03

Balances multilingual fluency with efficient training size.

Abstract

India is a diverse society with unique challenges in developing AI systems, including linguistic diversity, oral traditions, data accessibility, and scalability. Existing foundation models are primarily trained on English, limiting their effectiveness for India's population. Indic languages comprise only 1 percent of Common Crawl corpora despite India representing 18 percent of the global population, leading to linguistic biases. Thousands of regional languages, dialects, and code mixing create additional representation challenges due to sparse training data. We introduce Krutrim LLM, a 2 trillion token multilingual model designed for India's linguistic landscape. It incorporates the largest known Indic dataset, mitigating data scarcity and ensuring balanced performance across dialects. Krutrim outperforms or matches state-of-the-art models on Indic benchmarks while maintaining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ola-krutrim/Krutrim-1-7B
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation