KyrgyzBERT: A Compact, Efficient Language Model for Kyrgyz NLP

Adilet Metinov; Gulida M. Kudakeeva; Gulnara D. Kabaeva

arXiv:2511.20182·cs.CL·November 26, 2025

KyrgyzBERT: A Compact, Efficient Language Model for Kyrgyz NLP

Adilet Metinov, Gulida M. Kudakeeva, Gulnara D. Kabaeva

PDF

Open Access

TL;DR

KyrgyzBERT is the first monolingual Kyrgyz language model, designed to improve NLP tools for Kyrgyz by providing a compact, efficient BERT-based model with a custom tokenizer and a new sentiment analysis benchmark.

Contribution

The paper introduces KyrgyzBERT, the first publicly available Kyrgyz-specific BERT model, along with a new sentiment analysis dataset, advancing NLP resources for the low-resource Kyrgyz language.

Findings

01

KyrgyzBERT achieves an F1-score of 0.8280 on Kyrgyz sentiment analysis.

02

KyrgyzBERT outperforms larger multilingual models in Kyrgyz NLP tasks.

03

All resources are publicly released to foster further research.

Abstract

Kyrgyz remains a low-resource language with limited foundational NLP tools. To address this gap, we introduce KyrgyzBERT, the first publicly available monolingual BERT-based language model for Kyrgyz. The model has 35.9M parameters and uses a custom tokenizer designed for the language's morphological structure. To evaluate performance, we create kyrgyz-sst2, a sentiment analysis benchmark built by translating the Stanford Sentiment Treebank and manually annotating the full test set. KyrgyzBERT fine-tuned on this dataset achieves an F1-score of 0.8280, competitive with a fine-tuned mBERT model five times larger. All models, data, and code are released to support future research in Kyrgyz NLP.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSentiment Analysis and Opinion Mining · Hate Speech and Cyberbullying Detection · Topic Modeling