Benchmarking Multilabel Topic Classification in the Kyrgyz Language

Anton Alekseev; Sergey I. Nikolenko; Gulnara Kabaeva

arXiv:2308.15952·cs.CL·August 31, 2023·1 cites

Benchmarking Multilabel Topic Classification in the Kyrgyz Language

Anton Alekseev, Sergey I. Nikolenko, Gulnara Kabaeva

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new benchmark dataset for multilabel topic classification in Kyrgyz, providing baseline models and analysis to advance NLP resources for this underrepresented language.

Contribution

It presents the first public multilabel topic classification dataset for Kyrgyz and evaluates multiple baseline models on this resource.

Findings

01

Neural models outperform classical statistical models.

02

Baseline scores establish a reference for future research.

03

Discussion highlights challenges and future directions.

Abstract

Kyrgyz is a very underrepresented language in terms of modern natural language processing resources. In this work, we present a new public benchmark for topic classification in Kyrgyz, introducing a dataset based on collected and annotated data from the news site 24.KG and presenting several baseline models for news classification in the multilabel setting. We train and evaluate both classical statistical and neural models, reporting the scores, discussing the results, and proposing directions for future work.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alexeyev/kyrgyz-multi-label-topic-classification
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Advanced Text Analysis Techniques · Topic Modeling