Toward More Effective Human Evaluation for Machine Translation

Bel\'en Sald\'ias; George Foster; Markus Freitag; Qijun Tan

arXiv:2204.05307·cs.CL·April 12, 2022

Toward More Effective Human Evaluation for Machine Translation

Bel\'en Sald\'ias, George Foster, Markus Freitag, Qijun Tan

PDF

Open Access

TL;DR

This paper proposes a stratified sampling method leveraging document membership and automatic metrics to reduce human annotation costs while maintaining accurate evaluation of machine translation quality.

Contribution

It introduces a simple, effective sampling approach that improves evaluation accuracy and reduces costs in human assessments of machine translation.

Findings

01

Up to 20% reduction in average absolute error

02

Improved estimates with stratified sampling and control variates

03

Applicable to structured evaluation problems

Abstract

Improvements in text generation technologies such as machine translation have necessitated more costly and time-consuming human evaluation procedures to ensure an accurate signal. We investigate a simple way to reduce cost by reducing the number of text segments that must be annotated in order to accurately predict a score for a complete test set. Using a sampling approach, we demonstrate that information from document membership and automatic metrics can help improve estimates compared to a pure random sampling baseline. We achieve gains of up to 20% in average absolute error by leveraging stratified sampling and control variates. Our techniques can improve estimates made from a fixed annotation budget, are easy to implement, and can be applied to any problem with structure similar to the one we study.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems