KyrgyzBERT: A Compact, Efficient Language Model for Kyrgyz NLP
Adilet Metinov, Gulida M. Kudakeeva, Gulnara D. Kabaeva

TL;DR
KyrgyzBERT is the first monolingual Kyrgyz language model, designed to improve NLP tools for Kyrgyz by providing a compact, efficient BERT-based model with a custom tokenizer and a new sentiment analysis benchmark.
Contribution
The paper introduces KyrgyzBERT, the first publicly available Kyrgyz-specific BERT model, along with a new sentiment analysis dataset, advancing NLP resources for the low-resource Kyrgyz language.
Findings
KyrgyzBERT achieves an F1-score of 0.8280 on Kyrgyz sentiment analysis.
KyrgyzBERT outperforms larger multilingual models in Kyrgyz NLP tasks.
All resources are publicly released to foster further research.
Abstract
Kyrgyz remains a low-resource language with limited foundational NLP tools. To address this gap, we introduce KyrgyzBERT, the first publicly available monolingual BERT-based language model for Kyrgyz. The model has 35.9M parameters and uses a custom tokenizer designed for the language's morphological structure. To evaluate performance, we create kyrgyz-sst2, a sentiment analysis benchmark built by translating the Stanford Sentiment Treebank and manually annotating the full test set. KyrgyzBERT fine-tuned on this dataset achieves an F1-score of 0.8280, competitive with a fine-tuned mBERT model five times larger. All models, data, and code are released to support future research in Kyrgyz NLP.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Hate Speech and Cyberbullying Detection · Topic Modeling
