100,000+ Movie Reviews from Kazakhstan: Russian, Kazakh, and Code-Switched Texts

Rustem Yeshpanov

arXiv:2605.08600·cs.CL·May 13, 2026

100,000+ Movie Reviews from Kazakhstan: Russian, Kazakh, and Code-Switched Texts

Rustem Yeshpanov

PDF

1 Datasets

TL;DR

This paper introduces a large, multilingual movie review dataset from Kazakhstan, annotated for sentiment and ratings, and benchmarks classical and transformer-based models on sentiment tasks.

Contribution

It provides a new publicly available corpus with detailed annotations and evaluates the performance of various models on multilingual sentiment analysis tasks.

Findings

01

Transformer models outperform classical baselines in polarity classification.

02

Score classification is challenging due to class imbalance and subtle rating differences.

03

The dataset covers reviews from 2001 to 2025 in multiple languages, including code-switched texts.

Abstract

We present a new publicly available corpus of 100,502 movie reviews from Kazakhstan collected from kino.kz, spanning 2001-2025 and covering 4,943 unique titles. The dataset is multilingual, consisting mainly of Russian reviews alongside Kazakh and code-switched texts. Reviews are manually annotated for language and sentiment polarity, and 11,309 reviews additionally contain explicit user-provided ratings. We define two sentiment tasks -- three-way polarity classification and five-class score classification -- and benchmark classical BoW/TF-IDF baselines against multilingual transformer models (mBERT, XLM-RoBERTa, RemBERT). Experimental results show that transformer models consistently outperform classical baselines on polarity classification, while score classification remains challenging under leakage-controlled evaluation due to severe class imbalance and subtle distinctions between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

yeshpanovrustem/100k_movie_reviews_from_kz
dataset· 65 dl
65 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.