TL;DR
This paper formalizes the task of estimating translation difficulty, introduces new metrics and models for it, and demonstrates their utility in creating more challenging benchmarks for machine translation systems.
Contribution
It proposes a formal definition of translation difficulty, develops new difficulty estimation models, and shows their effectiveness in improving evaluation benchmarks.
Findings
Dedicated difficulty models outperform heuristics and LLM-based judges.
Sentinel-src models achieve top performance in difficulty estimation.
Difficulty estimators can identify texts that challenge current translation systems.
Abstract
Machine translation quality has steadily improved over the years, achieving near-perfect translations in recent benchmarks. These high-quality outputs make it difficult to distinguish between state-of-the-art models and to identify areas for future improvement. In this context, automatically identifying texts where machine translation systems struggle holds promise for developing more discriminative evaluations and guiding future research. In this work, we address this gap by formalizing the task of translation difficulty estimation, defining a text's difficulty based on the expected quality of its translations. We introduce a new metric to evaluate difficulty estimators and use it to assess both baselines and novel approaches. Finally, we demonstrate the practical utility of difficulty estimators by using them to construct more challenging benchmarks for machine translation. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
