KazByte: Adapting Qwen models to Kazakh via Byte-level Adapter
Rauan Akylzhanov

TL;DR
This paper introduces ByteKaz, a byte-level adapter approach to adapt large language models for Kazakh, bypassing tokenization issues and aiming to improve model performance on Kazakh benchmarks.
Contribution
Proposes a novel byte-level adapter method to adapt Qwen models for Kazakh, reducing tokenization problems and improving language-specific performance.
Findings
Tokenizer bypass reduces compute and context window issues.
Two-stage training improves model adaptation to Kazakh.
Design and hypotheses are presented, empirical validation is ongoing.
Abstract
Large language models fragment Kazakh text into many more tokens than equivalent English text, because their tokenizers were built for high-resource languages. This tokenizer tax inflates compute, shortens the effective context window, and weakens the model's grip on Kazakh morphology. We propose to bypass the tokenizer entirely by feeding raw bytes through a small adapter that learns to speak the internal language of a frozen Qwen2.5-7B. Once the adapter is trained, we freeze it and fine-tune only the attention layers of Qwen on Kazakh text. Our central hypothesis is that this two-stage process -- first teach the interface, then adapt the model -- should match or exceed the accuracy of the original Qwen2.5-7B on standard Kazakh benchmarks. This report describes the ByteKaz architecture and training protocol. Empirical validation is ongoing; this version stakes the design and hypotheses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
