Estonian WinoGrande Dataset: Comparative Analysis of LLM Performance on Human and Machine Translation

Marii Ojastu; Hele-Andra Kuulmets; Aleksei Dorkin; Marika Borovikova; Dage S\"arg; Kairit Sirts

arXiv:2511.17290·cs.CL·March 31, 2026

Estonian WinoGrande Dataset: Comparative Analysis of LLM Performance on Human and Machine Translation

Marii Ojastu, Hele-Andra Kuulmets, Aleksei Dorkin, Marika Borovikova, Dage S\"arg, Kairit Sirts

PDF

1 Datasets

TL;DR

This study introduces an Estonian version of the WinoGrande dataset, compares model performance on human and machine translations, and assesses the impact of prompt engineering on translation quality.

Contribution

It provides a culturally adapted Estonian dataset and analyzes model performance differences between human and machine translations, highlighting the importance of expert involvement.

Findings

01

Model performance is slightly lower on human-translated Estonian data compared to English.

02

Performance on machine-translated data is significantly worse.

03

Prompt engineering offers limited improvements in translation quality and model accuracy.

Abstract

In this paper, we present a localized and culturally adapted Estonian translation of the test set from the widely used commonsense reasoning benchmark, WinoGrande. We detail the translation and adaptation process carried out by translation specialists and evaluate the performance of both proprietary and open source models on the human translated benchmark. Additionally, we explore the feasibility of achieving high-quality machine translation by incorporating insights from the manual translation process into the design of a detailed prompt. This prompt is specifically tailored to address both the linguistic characteristics of Estonian and the unique translation challenges posed by the WinoGrande dataset. Our findings show that model performance on the human translated Estonian dataset is slightly lower than on the original English test set, while performance on machine-translated data is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

tartuNLP/winogrande_et
dataset· 292 dl
292 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.