Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement

Gabriele Sarti; Vil\'em Zouhar; Malvina Nissim; Arianna Bisazza

arXiv:2505.23183·cs.CL·November 18, 2025

Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement

Gabriele Sarti, Vil\'em Zouhar, Malvina Nissim, Arianna Bisazza

PDF

1 Repo 1 Datasets 1 Video

TL;DR

This paper explores unsupervised word-level quality estimation for machine translation by analyzing model interpretability and uncertainty, revealing the limitations of supervised methods under label uncertainty and emphasizing the potential of unsupervised metrics.

Contribution

It introduces an unsupervised approach leveraging interpretability and uncertainty quantification to detect translation errors, challenging the reliance on large labeled datasets.

Findings

01

Unsupervised metrics outperform supervised ones under label uncertainty

02

Multiple human annotations reveal variability affecting metric performance

03

Single-annotator evaluations are often brittle and unreliable

Abstract

Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs and has found many uses, including assisting translators during post-editing. Modern WQE techniques are often expensive, involving prompting of large language models or ad-hoc training on large amounts of human-labeled data. In this work, we investigate efficient alternatives exploiting recent advances in language model interpretability and uncertainty quantification to identify translation errors from the inner workings of translation models. In our evaluation spanning 14 metrics across 12 translation directions, we quantify the impact of human label variation on metric performance by using multiple sets of human labels. Our results highlight the untapped potential of unsupervised metrics, the shortcomings of supervised methods when faced with label uncertainty, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gsarti/labl
noneOfficial

Datasets

gsarti/unsup_wqe_metrics
dataset· 30 dl
30 dl

Videos

Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement· underline