Towards Nepali-language LLMs: Efficient GPT training with a Nepali BPE tokenizer

Adarsha Shrestha; Basanta Pokharel; Binit Shrestha; Smriti Adhikari; Dinesh Gothe

arXiv:2512.14585·cs.CL·December 17, 2025

Towards Nepali-language LLMs: Efficient GPT training with a Nepali BPE tokenizer

Adarsha Shrestha, Basanta Pokharel, Binit Shrestha, Smriti Adhikari, Dinesh Gothe

PDF

Open Access

TL;DR

This paper introduces a Nepali-language GPT-2 model trained with a custom BPE tokenizer and optimized training strategies, demonstrating effective Nepali text generation despite limited resources.

Contribution

It presents a novel Nepali-specific BPE tokenizer and training methodology for GPT-2, tailored for low-resource Nepali NLP tasks.

Findings

01

Achieved a perplexity of 21.80 on Nepali text

02

Demonstrated coherent Nepali news-style text generation

03

Implemented memory-efficient training with FlashAttention

Abstract

Nepali, a low-resource language spoken by over 32 million people, continues to face challenges in natural language processing (NLP) due to its complex grammar, agglutinative morphology, and limited availability of high-quality corpora. Most efforts to date have centered on basic encoder architectures; they remain insufficient for Nepali-specific text generation. This study presents a GPT-2-based Nepali language model trained using several training strategies inspired by GPT-3, including optimized learning rate schedules, batch scaling, and architectural refinements. A custom 16k Byte-Pair Encoding (BPE) tokenizer was trained exclusively on Nepali text to ensure more consistent segmentation and improved input representation. The model was pretrained on a combined dataset comprising a 10.75GB cleaned NepBERTa corpus and additional web-scraped Nepali news articles. FlashAttention was…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification