100,000+ Movie Reviews from Kazakhstan: Russian, Kazakh, and Code-Switched Texts
Rustem Yeshpanov

TL;DR
This paper introduces a large, multilingual movie review dataset from Kazakhstan, annotated for sentiment and ratings, and benchmarks classical and transformer-based models on sentiment tasks.
Contribution
It provides a new publicly available corpus with detailed annotations and evaluates the performance of various models on multilingual sentiment analysis tasks.
Findings
Transformer models outperform classical baselines in polarity classification.
Score classification is challenging due to class imbalance and subtle rating differences.
The dataset covers reviews from 2001 to 2025 in multiple languages, including code-switched texts.
Abstract
We present a new publicly available corpus of 100,502 movie reviews from Kazakhstan collected from kino.kz, spanning 2001-2025 and covering 4,943 unique titles. The dataset is multilingual, consisting mainly of Russian reviews alongside Kazakh and code-switched texts. Reviews are manually annotated for language and sentiment polarity, and 11,309 reviews additionally contain explicit user-provided ratings. We define two sentiment tasks -- three-way polarity classification and five-class score classification -- and benchmark classical BoW/TF-IDF baselines against multilingual transformer models (mBERT, XLM-RoBERTa, RemBERT). Experimental results show that transformer models consistently outperform classical baselines on polarity classification, while score classification remains challenging under leakage-controlled evaluation due to severe class imbalance and subtle distinctions between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
