A Dataset for Probing Translationese Preferences in English-to-Swedish Translation
Jenny Kunz, Anja Jarochenko, Marcel Bollmann

TL;DR
This paper introduces a new dataset for analyzing translationese in English-to-Swedish translation, revealing that language models tend to prefer literal, translationese phrasing over idiomatic alternatives, especially when source context is limited.
Contribution
The paper presents the first freely available dataset contrasting translationese with idiomatic translations for English-to-Swedish, along with analysis of model preferences and biases.
Findings
Models often favor translationese phrasing.
Human alternatives are preferred when source context is omitted.
Models exhibit bias toward literal translations.
Abstract
Translations often carry traces of the source language, a phenomenon known as translationese. We introduce the first freely available English-to-Swedish dataset contrasting translationese sentences with idiomatic alternatives, designed to probe intrinsic preferences of language models. It includes error tags and descriptions of the problems in the original translations. In experiments evaluating smaller Swedish and multilingual LLMs with our dataset, we find that they often favor the translationese phrasing. Human alternatives are chosen more often when the English source sentence is omitted, indicating that exposure to the source biases models toward literal translations, although even without context models often prefer the translationese variant. Our dataset and findings provide a resource and benchmark for developing models that produce more natural, idiomatic output in non-English…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Translation Studies and Practices · Text Readability and Simplification
