HindiLLM: Large Language Model for Hindi
Sanjay Chouhan, Shubha Brata Nath, Aparajita Dutta

TL;DR
This paper introduces HindiLLM, a large language model for Hindi, trained through unsupervised pre-training and supervised fine-tuning, achieving superior performance on various language tasks compared to existing models.
Contribution
The paper presents the first large-scale Hindi language models, HindiLLM-Small and HindiLLM-Medium, with a new Hindi tokenizer and fine-tuning on multiple NLP tasks.
Findings
HindiLLM models outperform existing models on language tasks
Fine-tuned models achieve high accuracy in sentiment analysis and classification
A new high-quality Hindi text corpus and tokenizer are introduced.
Abstract
The advancements in the Large Language Model (LLM) have helped in solving several problems related to language processing. Most of the researches have focused on the English language only, because of its popularity and abundance on the internet. However, a high-performance language model for Hindi and other Indic languages is lacking in the literature. In this work, we have pre-trained two autoregressive LLM models for the Hindi language, namely HindiLLM-Small and HindiLLM-Medium. We use a two-step process comprising unsupervised pre-training and supervised fine-tuning. First, we create a large and high-quality text corpus for unsupervised pre-training. Next, we train a Byte-Pair Encoding, named HindiLLM tokenizer, using the pre-training text data. We then perform training on the unlabeled data, known as the pre-training step, to get the HindiLLM base models. Furthermore, we perform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsBalanced Selection
