Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations

Shaomu Tan; Ryosuke Mitani; Ritvik Choudhary; Qiyu Wu; Toshiyuki Sekiya; Christof Monz

arXiv:2512.18906·cs.CL·December 23, 2025

Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations

Shaomu Tan, Ryosuke Mitani, Ritvik Choudhary, Qiyu Wu, Toshiyuki Sekiya, Christof Monz

PDF

Open Access

TL;DR

Remedy-R is a novel, interpretable generative metric for machine translation evaluation that uses reasoning and reinforcement learning to assess translations without error annotations, showing strong performance and practical utility.

Contribution

It introduces Remedy-R, a reasoning-based MT evaluation metric trained without error-span annotations, capable of providing step-by-step analysis and improving translation quality through a feedback loop.

Findings

01

Remedy-R achieves competitive performance with top metrics and GPT-4 judges.

02

It generalizes well to other languages and out-of-distribution data.

03

The Remedy-R Agent improves translation quality across various models.

Abstract

Over the years, automatic MT metrics have hillclimbed benchmarks and presented strong and sometimes human-level agreement with human ratings. Yet they remain black-box, offering little insight into their decision-making and often failing under real-world out-of-distribution (OOD) inputs. We introduce Remedy-R, a reasoning-driven generative MT metric trained with reinforcement learning from pairwise translation preferences, without requiring error-span annotations or distillation from closed LLMs. Remedy-R produces step-by-step analyses of accuracy, fluency, and completeness, followed by a final score, enabling more interpretable assessments. With only 60K training pairs across two language pairs, Remedy-R remains competitive with top scalar metrics and GPT-4-based judges on WMT22-24 meta-evaluation, generalizes to other languages, and exhibits strong robustness on OOD stress tests.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Artificial Intelligence in Healthcare and Education