SozKZ: Training Efficient Small Language Models for Kazakh from Scratch

Saken Tukenov

arXiv:2603.20854·cs.CL·March 24, 2026

SozKZ: Training Efficient Small Language Models for Kazakh from Scratch

Saken Tukenov

PDF

Open Access 1 Models

TL;DR

This paper introduces SozKZ, a family of small Kazakh language models trained from scratch with a dedicated tokenizer, demonstrating competitive performance on benchmarks and highlighting the benefits of dedicated, low-resource language models.

Contribution

We develop and evaluate small, from-scratch Kazakh language models with a dedicated tokenizer, showing their effectiveness compared to larger multilingual models.

Findings

01

600M model achieves 30.3% accuracy on cultural QA

02

Models outperform multilingual baselines up to 2B parameters

03

Scaling from 50M to 600M improves performance consistently

Abstract

Kazakh, a Turkic language spoken by over 22 million people, remains underserved by existing multilingual language models, which allocate minimal capacity to low-resource languages and employ tokenizers ill-suited to agglutinative morphology. We present SozKZ, a family of Llama-architecture language models (50M-600M parameters) trained entirely from scratch on 9 billion tokens of Kazakh text with a dedicated 50K BPE tokenizer. We evaluate all models on three Kazakh benchmarks -- multiple-choice cultural QA, reading comprehension (Belebele), and topic classification (SIB-200) -- alongside five multilingual baselines ranging from 500M to 3B parameters. Our 600M model achieves 30.3% accuracy on Kazakh cultural QA, approaching the 32.0% of Llama-3.2-1B (2x larger), and 25.5% on SIB-200 topic classification, surpassing all evaluated multilingual models up to 2B parameters. We observe…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
stukenov/sozkz-core-llama-600m-kk-base-v1
model· 62 dl
62 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Language and cultural evolution