MetricX-25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task

Juraj Juraska; Tobias Domhan; Mara Finkelstein; Tetsuji Nakagawa; Geza Kovacs; Daniel Deutsch; Pidong Wang; Markus Freitag

arXiv:2510.24707·cs.CL·October 29, 2025

MetricX-25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task

Juraj Juraska, Tobias Domhan, Mara Finkelstein, Tetsuji Nakagawa, Geza Kovacs, Daniel Deutsch, Pidong Wang, Markus Freitag

PDF

1 Video

TL;DR

This paper introduces MetricX-25 and GemSpanEval, advanced models for translation quality assessment and error span detection, leveraging fine-tuned multilingual models to improve accuracy and interpretability in WMT25 shared task evaluations.

Contribution

The paper presents novel models MetricX-25 and GemSpanEval, enhancing translation evaluation by improving quality score prediction and error span detection using state-of-the-art multilingual models.

Findings

01

MetricX-25 outperforms its predecessor in quality score prediction.

02

GemSpanEval achieves competitive error span detection accuracy.

03

Error span detection as a generative task improves unambiguous identification.

Abstract

In this paper, we present our submissions to the unified WMT25 Translation Evaluation Shared Task. For the Quality Score Prediction subtask, we create a new generation of MetricX with improvements in the input format and the training protocol, while for the Error Span Detection subtask we develop a new model, GemSpanEval, trained to predict error spans along with their severities and categories. Both systems are based on the state-of-the-art multilingual open-weights model Gemma 3, fine-tuned on publicly available WMT data. We demonstrate that MetricX-25, adapting Gemma 3 to an encoder-only architecture with a regression head on top, can be trained to effectively predict both MQM and ESA quality scores, and significantly outperforms its predecessor. Our decoder-only GemSpanEval model, on the other hand, we show to be competitive in error span detection with xCOMET, a strong encoder-only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MetricX-25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task· underline