Toward More Effective Human Evaluation for Machine Translation
Bel\'en Sald\'ias, George Foster, Markus Freitag, Qijun Tan

TL;DR
This paper proposes a stratified sampling method leveraging document membership and automatic metrics to reduce human annotation costs while maintaining accurate evaluation of machine translation quality.
Contribution
It introduces a simple, effective sampling approach that improves evaluation accuracy and reduces costs in human assessments of machine translation.
Findings
Up to 20% reduction in average absolute error
Improved estimates with stratified sampling and control variates
Applicable to structured evaluation problems
Abstract
Improvements in text generation technologies such as machine translation have necessitated more costly and time-consuming human evaluation procedures to ensure an accurate signal. We investigate a simple way to reduce cost by reducing the number of text segments that must be annotated in order to accurately predict a score for a complete test set. Using a sampling approach, we demonstrate that information from document membership and automatic metrics can help improve estimates compared to a pure random sampling baseline. We achieve gains of up to 20% in average absolute error by leveraging stratified sampling and control variates. Our techniques can improve estimates made from a fixed annotation budget, are easy to implement, and can be applied to any problem with structure similar to the one we study.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
