Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains
Vil\'em Zouhar, Shuoyang Ding, Anna Currey, Tatyana Badeka, Jenyuan, Wang, Brian Thompson

TL;DR
This paper evaluates the robustness of fine-tuned machine translation metrics across unseen biomedical domains, revealing significant performance drops compared to surface form-based and pre-trained metrics.
Contribution
It introduces a comprehensive biomedical MT quality dataset and systematically compares the domain robustness of various MT metrics.
Findings
Fine-tuned metrics perform poorly on unseen domains.
Pre-trained and surface form-based metrics are more robust.
New dataset enables domain shift evaluation in MT quality assessment.
Abstract
We introduce a new, extensive multidimensional quality metrics (MQM) annotated dataset covering 11 language pairs in the biomedical domain. We use this dataset to investigate whether machine translation (MT) metrics which are fine-tuned on human-generated MT quality judgements are robust to domain shifts between training and inference. We find that fine-tuned metrics exhibit a substantial performance drop in the unseen domain scenario relative to metrics that rely on the surface form, as well as pre-trained metrics which are not fine-tuned on MT quality judgments.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
