Benchmarking Parameter-Efficient Fine-Tuning of Large Language Models for Low-Resource Tajik Text Generation with the Tajik Web Corpus
Mullosharaf K. Arabov

TL;DR
This paper introduces the largest Tajik language corpus and systematically evaluates parameter-efficient fine-tuning methods for large language models, providing practical guidelines for low-resource Tajik text generation.
Contribution
It creates the first large verified Tajik corpus and analyzes PEFT methods' effectiveness, offering insights to optimize computational costs and model performance.
Findings
Best results with Mistral 7B and QLoRA (r=16): perplexity 5.03.
Full fine-tuning of small GPT-2 models yields lower perplexity but causes catastrophic forgetting.
Encoder-only XLM-RoBERTa performs poorly with perplexity 59.3.
Abstract
This paper is devoted to the adaptation of generative large language models for the Tajik language, a low-resource language with Cyrillic script. To overcome the shortage of digital text resources, the author created and publicly released the Tajik Web Corpus, the largest open-access corpus of Tajik, comprising 319,298 documents (~1.11 billion characters). On a subsample of 10,000 documents, 17 configurations were benchmarked, covering autoregressive, encoder-decoder, and encoder-only models with three fine-tuning strategies: full fine-tuning, LoRA, and QLoRA (ranks 8 and 16). Quality was assessed via perplexity and cross-entropy loss; peak GPU memory and training time were also recorded. Best results were achieved by Mistral 7B with QLoRA (r=16): mean perplexity 5.03, standard deviation 0.03. Increasing rank from 8 to 16 gave statistically insignificant improvement while raising memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
