Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations
Kyo Gerrits, Rik van Noord, Ana Guerberof Arenas

TL;DR
This study evaluates automatic metrics and LLM-based judgments for literary translation, revealing significant limitations in assessing creativity and cultural nuance, especially in poetry and creative shifts.
Contribution
It provides a detailed dataset and analysis showing current evaluation tools poorly match professional judgments on creativity in literary translation.
Findings
AEMs and LLM evaluations correlate poorly with professional creativity assessments.
LLMs tend to favor machine translations and penalize creative, culturally nuanced solutions.
Performance drops notably for poetic and highly literary genres.
Abstract
This article investigates the performance of automatic evaluation metrics (AEMs) and LLM-as-a-judge evaluation on literary translation across multiple languages, genres, and translation modalities. The aim is to assess how well these tools align with professionals when evaluating translation, creativity (creative shifts & errors), and see if they can substitute laborious manual annotations. A dataset of literary translations across three modalities (human translation, machine translation, and post-editing), three genres and three language pairs was created and annotated in detail for creativity by experienced professional literary translators. The results show that both AEMs and LLM-as-a-judge evaluations correlate poorly with professional evaluations on creativity, with LLM-as-a-judge showing a systematic bias in favour of machine-translated texts and penalising creative and culturally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
