TL;DR
This paper introduces MetricX-25 and GemSpanEval, advanced models for translation quality assessment and error span detection, leveraging fine-tuned multilingual models to improve accuracy and interpretability in WMT25 shared task evaluations.
Contribution
The paper presents novel models MetricX-25 and GemSpanEval, enhancing translation evaluation by improving quality score prediction and error span detection using state-of-the-art multilingual models.
Findings
MetricX-25 outperforms its predecessor in quality score prediction.
GemSpanEval achieves competitive error span detection accuracy.
Error span detection as a generative task improves unambiguous identification.
Abstract
In this paper, we present our submissions to the unified WMT25 Translation Evaluation Shared Task. For the Quality Score Prediction subtask, we create a new generation of MetricX with improvements in the input format and the training protocol, while for the Error Span Detection subtask we develop a new model, GemSpanEval, trained to predict error spans along with their severities and categories. Both systems are based on the state-of-the-art multilingual open-weights model Gemma 3, fine-tuned on publicly available WMT data. We demonstrate that MetricX-25, adapting Gemma 3 to an encoder-only architecture with a regression head on top, can be trained to effectively predict both MQM and ESA quality scores, and significantly outperforms its predecessor. Our decoder-only GemSpanEval model, on the other hand, we show to be competitive in error span detection with xCOMET, a strong encoder-only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
