Difficulty-Aware Machine Translation Evaluation
Runzhe Zhan, Xuebo Liu, Derek F. Wong, Lidia S. Chao

TL;DR
This paper introduces a difficulty-aware evaluation metric for machine translation that weights translations based on their predicted difficulty, improving correlation with human judgment especially among competitive systems.
Contribution
It proposes a novel MT evaluation metric that incorporates translation difficulty, addressing limitations of existing metrics that treat all sentences equally.
Findings
Outperforms standard metrics in human correlation on WMT19 dataset.
Effectively distinguishes between highly competitive MT systems.
Maintains robustness even when systems are very similar.
Abstract
The high-quality translation results produced by machine translation (MT) systems still pose a huge challenge for automatic evaluation. Current MT evaluation pays the same attention to each sentence component, while the questions of real-world examinations (e.g., university examinations) have different difficulties and weightings. In this paper, we propose a novel difficulty-aware MT evaluation metric, expanding the evaluation dimension by taking translation difficulty into consideration. A translation that fails to be predicted by most MT systems will be treated as a difficult one and assigned a large weight in the final score function, and conversely. Experimental results on the WMT19 English-German Metrics shared tasks show that our proposed method outperforms commonly used MT metrics in terms of human correlation. In particular, our proposed method performs well even when all the MT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
