Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress

Lorenzo Proietti; Stefano Perrella; Roberto Navigli

arXiv:2506.19571·cs.CL·June 25, 2025

Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress

Lorenzo Proietti, Stefano Perrella, Roberto Navigli

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates whether automatic MT evaluation metrics have reached human-level performance by comparing them to human judgments, revealing that some metrics now match or surpass human agreement levels, but caution is advised in interpreting these results.

Contribution

The study introduces human baselines into MT metric evaluation, providing an upper bound and critically analyzing the implications of metrics reaching human parity.

Findings

01

State-of-the-art metrics often match or outperform human agreement levels.

02

Human annotators are not consistently superior to automatic metrics.

03

The results raise questions about the reliability of current evaluation methods.

Abstract

In Machine Translation (MT) evaluation, metric performance is assessed based on agreement with human judgments. In recent years, automatic metrics have demonstrated increasingly high levels of agreement with humans. To gain a clearer understanding of metric performance and establish an upper bound, we incorporate human baselines in the MT meta-evaluation, that is, the assessment of MT metrics' capabilities. Our results show that human annotators are not consistently superior to automatic metrics, with state-of-the-art metrics often ranking on par with or higher than human baselines. Despite these findings suggesting human parity, we discuss several reasons for caution. Finally, we explore the broader implications of our results for the research field, asking: Can we still reliably measure improvements in MT evaluation? With this work, we aim to shed light on the limits of our ability to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sapienzanlp/human-parity-mt-eval
noneOfficial

Videos

Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress· underline

Taxonomy

TopicsNatural Language Processing Techniques