Development of Pre-Trained Transformer-based Models for the Nepali Language

Prajwal Thapa; Jinu Nyachhyon; Mridul Sharma; Bal Krishna Bal

arXiv:2411.15734·cs.CL·August 20, 2025

Development of Pre-Trained Transformer-based Models for the Nepali Language

Prajwal Thapa, Jinu Nyachhyon, Mridul Sharma, Bal Krishna Bal

PDF

Open Access 4 Models 1 Datasets

TL;DR

This paper introduces Nepali-specific transformer models (BERT, RoBERTa, GPT-2) trained on a large corpus, significantly advancing NLP capabilities for Nepali through improved understanding and generation tasks.

Contribution

It is the first to pre-train multiple transformer models on a large Nepali corpus and explore instruction tuning, filling a resource gap for Nepali NLP research.

Findings

01

Models outperform previous best on Nep-gLUE benchmark by 2 points.

02

Achieved state-of-the-art results in Nepali text generation.

03

Collected the largest Nepali text corpus to date.

Abstract

Transformer-based pre-trained language models have dominated the field of Natural Language Processing (NLP) for quite some time now. However, the Nepali language, spoken by approximately 32 million people worldwide, remains significantly underrepresented in this domain. This underrepresentation is primarily attributed to the scarcity of monolingual data corpora and limited available resources for the Nepali language. While existing efforts have predominantly concentrated on basic encoder-based models, there is a notable gap in the exploration of decoder-based architectures. To address this gap, we have collected 27.5 GB of Nepali text data, approximately 2.4x larger than any previously available Nepali language corpus. Leveraging this data, we pre-trained three different models i.e., BERT, RoBERTa, and GPT-2, exclusively for the Nepali Language. Furthermore, we performed instruction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

IRIIS-RESEARCH/Nepali-Text-Corpus
dataset· 699 dl
699 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Discriminative Fine-Tuning · Cosine Annealing · Linear Layer · Byte Pair Encoding · Adam · RoBERTa · Residual Connection · Weight Decay · Softmax