Can Perplexity Predict Fine-tuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali

Nishant Luitel; Nirajan Bekoju; Anand Kumar Sah; Subarna Shakya

arXiv:2404.18071·cs.CL·August 12, 2025

Can Perplexity Predict Fine-tuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali

Nishant Luitel, Nirajan Bekoju, Anand Kumar Sah, Subarna Shakya

PDF

Open Access

TL;DR

This study investigates how different tokenization strategies affect Nepali language models' understanding capabilities, revealing that SentencePiece tokenization outperforms byte-level BPE in downstream tasks for non-Latin scripts.

Contribution

The paper provides a comprehensive evaluation of six tokenization schemes on Nepali transformer models, emphasizing the importance of tokenization choices beyond perplexity for low-resource languages.

Findings

01

SentencePiece tokenization improves understanding tasks in Nepali models

02

Byte-level BPE is less effective for Nepali language understanding

03

Insights for developing better language models in low-resource, non-Latin script languages

Abstract

The impact of subword tokenization on language model performance is well-documented for perplexity, with finer granularity consistently reducing this intrinsic metric. However, research on how different tokenization schemes affect a model's understanding capabilities remains limited, particularly for non-Latin script languages. Addressing this gap, we conducted a comprehensive evaluation of six distinct tokenization strategies by pretraining transformer-based language models for Nepali and evaluating their performance across multiple downstream tasks. While recent prominent models like GPT, RoBERTa, Claude, LLaMA, Mistral, Falcon, and MPT have adopted byte-level BPE tokenization, our findings demonstrate that for Nepali, SentencePiece tokenization consistently yields superior results on understanding-based tasks. Unlike previous studies that primarily focused on BERT-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Linear Warmup With Cosine Annealing · Dense Connections · Linear Warmup With Linear Decay · Adam · Layer Normalization · Attention Dropout