When the Gold Standard Isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content
Lydia Nishimwe, Beno\^it Sagot, Rachel Bawden

TL;DR
Evaluating UGC translation is complex due to non-standard language, requiring guideline-aware metrics and translation standards to ensure fair assessment of model performance.
Contribution
This paper analyzes human translation guidelines for UGC, introduces a taxonomy of non-standard phenomena, and emphasizes the need for guideline-aware evaluation frameworks.
Findings
Translation scores vary with prompt instructions.
Aligning prompts with dataset guidelines improves model performance.
Guideline-aware evaluation is essential for fair UGC translation assessment.
Abstract
User-generated content (UGC) is characterised by frequent use of non-standard language, from spelling errors to expressive choices such as slang, character repetitions, and emojis. This makes evaluating UGC translation challenging: what counts as a "good" translation depends on the desired standardness level of the output. To explore this, we examine the human translation guidelines of four UGC datasets, and derive a taxonomy of twelve non-standard phenomena and five translation actions (NORMALISE, COPY, TRANSFER, OMIT, CENSOR). Our analysis reveals notable differences in how UGC is treated, resulting in a spectrum of standardness in reference translations. We show that translation scores of large language models are highly sensitive to prompts with explicit UGC translation instructions, and that they improve when they align with the dataset guidelines. We argue that fair evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
