Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains

Vil\'em Zouhar; Shuoyang Ding; Anna Currey; Tatyana Badeka; Jenyuan; Wang; Brian Thompson

arXiv:2402.18747·cs.CL·June 5, 2024·1 cites

Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains

Vil\'em Zouhar, Shuoyang Ding, Anna Currey, Tatyana Badeka, Jenyuan, Wang, Brian Thompson

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper evaluates the robustness of fine-tuned machine translation metrics across unseen biomedical domains, revealing significant performance drops compared to surface form-based and pre-trained metrics.

Contribution

It introduces a comprehensive biomedical MT quality dataset and systematically compares the domain robustness of various MT metrics.

Findings

01

Fine-tuned metrics perform poorly on unseen domains.

02

Pre-trained and surface form-based metrics are more robust.

03

New dataset enables domain shift evaluation in MT quality assessment.

Abstract

We introduce a new, extensive multidimensional quality metrics (MQM) annotated dataset covering 11 language pairs in the biomedical domain. We use this dataset to investigate whether machine translation (MT) metrics which are fine-tuned on human-generated MT quality judgements are robust to domain shifts between training and inference. We find that fine-tuned metrics exhibit a substantial performance drop in the unseen domain scenario relative to metrics that rely on the surface form, as well as pre-trained metrics which are not fine-tuned on MT quality judgments.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amazon-science/bio-mqm-dataset
pytorchOfficial

Datasets

zouharvi/bio-mqm-dataset
dataset· 40 dl
40 dl

Videos

Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains· underline

Taxonomy

TopicsNatural Language Processing Techniques