HindiLLM: Large Language Model for Hindi

Sanjay Chouhan; Shubha Brata Nath; Aparajita Dutta

arXiv:2412.20357·cs.CL·December 31, 2024

HindiLLM: Large Language Model for Hindi

Sanjay Chouhan, Shubha Brata Nath, Aparajita Dutta

PDF

TL;DR

This paper introduces HindiLLM, a large language model for Hindi, trained through unsupervised pre-training and supervised fine-tuning, achieving superior performance on various language tasks compared to existing models.

Contribution

The paper presents the first large-scale Hindi language models, HindiLLM-Small and HindiLLM-Medium, with a new Hindi tokenizer and fine-tuning on multiple NLP tasks.

Findings

01

HindiLLM models outperform existing models on language tasks

02

Fine-tuned models achieve high accuracy in sentiment analysis and classification

03

A new high-quality Hindi text corpus and tokenizer are introduced.

Abstract

The advancements in the Large Language Model (LLM) have helped in solving several problems related to language processing. Most of the researches have focused on the English language only, because of its popularity and abundance on the internet. However, a high-performance language model for Hindi and other Indic languages is lacking in the literature. In this work, we have pre-trained two autoregressive LLM models for the Hindi language, namely HindiLLM-Small and HindiLLM-Medium. We use a two-step process comprising unsupervised pre-training and supervised fine-tuning. First, we create a large and high-quality text corpus for unsupervised pre-training. Next, we train a Byte-Pair Encoding, named HindiLLM tokenizer, using the pre-training text data. We then perform training on the unlabeled data, known as the pre-training step, to get the HindiLLM base models. Furthermore, we perform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsBalanced Selection