Span-Level Machine Translation Meta-Evaluation
Stefano Perrella, Eric Morales Agostinho, Hugo Zaragoza

TL;DR
This paper evaluates various span-level metrics for MT error detection, identifies their limitations, and proposes a robust meta-evaluation method called MPP, which is used to assess current MT error detection techniques.
Contribution
It introduces MPP, a new meta-evaluation strategy for MT error detection, and provides insights into the effectiveness of existing span-level evaluation methods.
Findings
Different span-level metrics can produce substantially different rankings.
Widely-used techniques may be unsuitable for MT error detection evaluation.
MPP is shown to be a robust and effective meta-evaluation method.
Abstract
Machine Translation (MT) and automatic MT evaluation have improved dramatically in recent years, enabling numerous novel applications. Automatic evaluation techniques have evolved from producing scalar quality scores to precisely locating translation errors and assigning them error categories and severity levels. However, it remains unclear how to reliably measure the evaluation capabilities of auto-evaluators that do error detection, as no established technique exists in the literature. This work investigates different implementations of span-level precision, recall, and F-score, showing that seemingly similar approaches can yield substantially different rankings, and that certain widely-used techniques are unsuitable for evaluating MT error detection. We propose "match with partial overlap and partial credit" (MPP) with micro-averaging as a robust meta-evaluation strategy and release…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Software Testing and Debugging Techniques · Topic Modeling
