Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains

Finn Schmidt; Jan Philip Wahle; Terry Ruas; Bela Gipp

arXiv:2604.17393·cs.CL·April 22, 2026

Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains

Finn Schmidt, Jan Philip Wahle, Terry Ruas, Bela Gipp

PDF

1 Datasets

TL;DR

This study examines the robustness of automatic translation evaluation metrics across unseen domains, revealing that human disagreement significantly impacts perceived metric reliability and emphasizing the importance of human benchmarks.

Contribution

The paper introduces a new multi-annotator dataset for cross-domain translation evaluation and highlights the discrepancy between metric and human agreement in unseen domains.

Findings

01

Metrics appear robust at segment level but are less reliable when human variation is considered.

02

Averaging annotations improves inter-annotator agreement by up to 0.11.

03

Metrics perform worse than humans on unseen chemical domains, with lower agreement scores.

Abstract

Automatic evaluation metrics are central to the development of machine translation systems, yet their robustness under domain shift remains unclear. Most metrics are developed on the Workshop on Machine Translation (WMT) benchmarks, raising concerns about their robustness to unseen domains. Prior studies that analyze unseen domains vary translation systems, annotators, or evaluation conditions, confounding domain effects with human annotation noise. To address these biases, we introduce a systematic multi-annotator Cross-Domain Error-Span-Annotation dataset (CD-ESA), comprising 18.8k human error span annotations across three language pairs, where we fix annotators within each language pair and evaluate translations of the same six translation systems across one seen news domain and two unseen technical domains. Using this dataset, we first find that automatic metrics appear…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

FinnSchmidt/CD-ESA
dataset· 89 dl
89 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.