WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages &   Dialects

Daniel Deutsch; Eleftheria Briakou; Isaac Caswell; Mara Finkelstein,; Rebecca Galor; Juraj Juraska; Geza Kovacs; Alison Lui; Ricardo Rei; Jason; Riesa; Shruti Rijhwani; Parker Riley; Elizabeth Salesky; Firas Trabelsi,; Stephanie Winkler; Biao Zhang; Markus Freitag

arXiv:2502.12404·cs.CL·February 19, 2025

WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects

Daniel Deutsch, Eleftheria Briakou, Isaac Caswell, Mara Finkelstein,, Rebecca Galor, Juraj Juraska, Geza Kovacs, Alison Lui, Ricardo Rei, Jason, Riesa, Shruti Rijhwani, Parker Riley, Elizabeth Salesky, Firas Trabelsi,, Stephanie Winkler, Biao Zhang, Markus Freitag

PDF

Open Access 1 Repo 10 Models 5 Datasets

TL;DR

This paper expands the WMT24 dataset to include 55 languages and dialects, providing a comprehensive benchmark for evaluating multilingual machine translation performance of various models, including large language models.

Contribution

The work extends the WMT24 dataset to 55 languages with new references and post-edits, enabling broader evaluation of multilingual translation models.

Findings

01

LLMs outperform other MT systems across all 55 languages

02

The expanded dataset covers four diverse domains

03

Automatic metrics show LLMs as top performers

Abstract

As large language models (LLM) become more and more capable in languages other than English, it is important to collect benchmark datasets in order to evaluate their multilingual performance, including on tasks like machine translation (MT). In this work, we extend the WMT24 dataset to cover 55 languages by collecting new human-written references and post-edits for 46 new languages and dialects in addition to post-edits of the references in 8 out of 9 languages in the original WMT24 dataset. The dataset covers four domains: literary, news, social, and speech. We benchmark a variety of MT providers and LLMs on the collected dataset using automatic metrics and find that LLMs are the best-performing MT systems in all 55 languages. These results should be confirmed using a human-based evaluation, which we leave for future work.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research/mt-metrics-eval
none

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques