Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh

Nurkhan Laiyk; Daniil Orel; Rituraj Joshi; Maiya Goloburda; Yuxia Wang; Preslav Nakov; Fajri Koto

arXiv:2502.13647·cs.CL·March 17, 2026

Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh

Nurkhan Laiyk, Daniil Orel, Rituraj Joshi, Maiya Goloburda, Yuxia Wang, Preslav Nakov, Fajri Koto

PDF

Open Access 1 Datasets

TL;DR

This paper presents a large-scale instruction-following dataset for Kazakh, enhancing low-resource language understanding through LLM fine-tuning and demonstrating improved performance in various tasks.

Contribution

Introduces and open-sources a 10,600-sample instruction dataset for Kazakh, employing LLM-assisted data generation and manual verification to improve language model performance.

Findings

01

Fine-tuning models on the dataset improves task performance.

02

LLM-assisted data generation is effective for low-resource languages.

03

High-quality dataset enhances understanding of governance topics.

Abstract

Instruction tuning in low-resource languages remains underexplored due to limited text data, particularly in government and cultural domains. To address this, we introduce and open-source a large-scale (10,600 samples) instruction-following (IFT) dataset, covering key institutional and cultural knowledge relevant to Kazakhstan. Our dataset enhances LLMs' understanding of procedural, legal, and structural governance topics. We employ LLM-assisted data generation, comparing open-weight and closed-weight models for dataset construction, and select GPT-4o as the backbone. Each entity of our dataset undergoes full manual verification to ensure high quality. We also show that fine-tuning Qwen, Falcon, and Gemma on our dataset leads to consistent performance improvements in both multiple-choice and generative tasks, demonstrating the potential of LLM-assisted instruction tuning for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

nurkhan5l/kazakh-ift
dataset· 10 dl
10 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultilingual Education and Policy