SozKZ: Training Efficient Small Language Models for Kazakh from Scratch
Saken Tukenov

TL;DR
This paper introduces SozKZ, a family of small Kazakh language models trained from scratch with a dedicated tokenizer, demonstrating competitive performance on benchmarks and highlighting the benefits of dedicated, low-resource language models.
Contribution
We develop and evaluate small, from-scratch Kazakh language models with a dedicated tokenizer, showing their effectiveness compared to larger multilingual models.
Findings
600M model achieves 30.3% accuracy on cultural QA
Models outperform multilingual baselines up to 2B parameters
Scaling from 50M to 600M improves performance consistently
Abstract
Kazakh, a Turkic language spoken by over 22 million people, remains underserved by existing multilingual language models, which allocate minimal capacity to low-resource languages and employ tokenizers ill-suited to agglutinative morphology. We present SozKZ, a family of Llama-architecture language models (50M-600M parameters) trained entirely from scratch on 9 billion tokens of Kazakh text with a dedicated 50K BPE tokenizer. We evaluate all models on three Kazakh benchmarks -- multiple-choice cultural QA, reading comprehension (Belebele), and topic classification (SIB-200) -- alongside five multilingual baselines ranging from 500M to 3B parameters. Our 600M model achieves 30.3% accuracy on Kazakh cultural QA, approaching the 32.0% of Llama-3.2-1B (2x larger), and 25.5% on SIB-200 topic classification, surpassing all evaluated multilingual models up to 2B parameters. We observe…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Language and cultural evolution
