Benchmarking Multilabel Topic Classification in the Kyrgyz Language
Anton Alekseev, Sergey I. Nikolenko, Gulnara Kabaeva

TL;DR
This paper introduces a new benchmark dataset for multilabel topic classification in Kyrgyz, providing baseline models and analysis to advance NLP resources for this underrepresented language.
Contribution
It presents the first public multilabel topic classification dataset for Kyrgyz and evaluates multiple baseline models on this resource.
Findings
Neural models outperform classical statistical models.
Baseline scores establish a reference for future research.
Discussion highlights challenges and future directions.
Abstract
Kyrgyz is a very underrepresented language in terms of modern natural language processing resources. In this work, we present a new public benchmark for topic classification in Kyrgyz, introducing a dataset based on collected and annotated data from the news site 24.KG and presenting several baseline models for news classification in the multilabel setting. We train and evaluate both classical statistical and neural models, reporting the scores, discussing the results, and proposing directions for future work.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Advanced Text Analysis Techniques · Topic Modeling
