Sentence-level Aggregation of Lexical Metrics Correlates Stronger with   Human Judgements than Corpus-level Aggregation

Paulo Cavalin; Pedro Henrique Domingues; Claudio Pinhanez

arXiv:2407.12832·cs.CL·January 24, 2025

Sentence-level Aggregation of Lexical Metrics Correlates Stronger with Human Judgements than Corpus-level Aggregation

Paulo Cavalin, Pedro Henrique Domingues, Claudio Pinhanez

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that aggregating lexical metrics at the sentence level rather than the corpus level significantly improves their correlation with human judgments and aligns them more closely with neural metrics, especially for low-resource languages.

Contribution

It reveals that sentence-level aggregation enhances lexical metric performance and robustness, offering a better evaluation approach for machine translation systems, particularly in low-resource scenarios.

Findings

01

Sentence-level aggregation improves correlation with human judgments.

02

Averaging segment scores aligns lexical metrics with neural metrics.

03

Corpus-level aggregation reduces statistical robustness.

Abstract

In this paper we show that corpus-level aggregation hinders considerably the capability of lexical metrics to accurately evaluate machine translation (MT) systems. With empirical experiments we demonstrate that averaging individual segment-level scores can make metrics such as BLEU and chrF correlate much stronger with human judgements and make them behave considerably more similar to neural metrics such as COMET and BLEURT. We show that this difference exists because corpus- and segment-level aggregation differs considerably owing to the classical average of ratio versus ratio of averages Mathematical problem. Moreover, as we also show, such difference affects considerably the statistical robustness of corpus-level aggregation. Considering that neural metrics currently only cover a small set of sufficiently-resourced languages, the results in this paper can help make the evaluation of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Sentence-level Aggregation of Lexical Metrics Correlates Stronger with Human Judgements than Corpus-level Aggregation· underline

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSparse Evolutionary Training