With a Little Help from the Authors: Reproducing Human Evaluation of an MT Error Detector
Ond\v{r}ej Pl\'atek, Mateusz Lango, Ond\v{r}ej Du\v{s}ek

TL;DR
This paper attempts to reproduce a human evaluation study of an MT error detector, confirming its main conclusions but highlighting variability in human annotations and reproducibility challenges.
Contribution
It provides a detailed reproduction of a previous human evaluation experiment and discusses reproducibility issues and variability in human annotations.
Findings
Replicated results generally confirm original conclusions
Identified high variability in human annotation
Highlighted reproducibility challenges in human evaluation
Abstract
This work presents our efforts to reproduce the results of the human evaluation experiment presented in the paper of Vamvas and Sennrich (2022), which evaluated an automatic system detecting over- and undertranslations (translations containing more or less information than the original) in machine translation (MT) outputs. Despite the high quality of the documentation and code provided by the authors, we discuss some problems we found in reproducing the exact experimental setup and offer recommendations for improving reproducibility. Our replicated results generally confirm the conclusions of the original study, but in some cases, statistically significant differences were observed, suggesting a high variability of human annotation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
