Creation of the Estonian Subjectivity Dataset: Assessing the Degree of Subjectivity on a Scale

Karl Gustav Gailit; Kadri Muischnek; Kairit Sirts

arXiv:2512.09634·cs.CL·December 11, 2025

Creation of the Estonian Subjectivity Dataset: Assessing the Degree of Subjectivity on a Scale

Karl Gustav Gailit, Kadri Muischnek, Kairit Sirts

PDF

Open Access 1 Datasets

TL;DR

This paper introduces an Estonian subjectivity dataset with continuous ratings, analyzes annotation consistency, and explores automatic scoring using GPT-5, highlighting the potential and limitations of LLM-based methods.

Contribution

It creates the first Estonian subjectivity dataset with continuous scores and evaluates LLM-based automatic annotation, providing insights into annotation reliability and automation feasibility.

Findings

01

Moderate inter-annotator agreement with some divergent scores.

02

Re-annotation improved annotation consistency.

03

GPT-5 scores closely matched human annotations but showed notable differences.

Abstract

This article presents the creation of an Estonian-language dataset for document-level subjectivity, analyzes the resulting annotations, and reports an initial experiment of automatic subjectivity analysis using a large language model (LLM). The dataset comprises of 1,000 documents-300 journalistic articles and 700 randomly selected web texts-each rated for subjectivity on a continuous scale from 0 (fully objective) to 100 (fully subjective) by four annotators. As the inter-annotator correlations were moderate, with some texts receiving scores at the opposite ends of the scale, a subset of texts with the most divergent scores was re-annotated, with the inter-annotator correlation improving. In addition to human annotations, the dataset includes scores generated by GPT-5 as an experiment on annotation automation. These scores were similar to human annotators, however several differences…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

tartuNLP/Estonian_Subjectivity
dataset· 616 dl
616 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Mental Health via Writing · Sentiment Analysis and Opinion Mining