RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation

Dongyub Jude Lee; Zhenyi Ye; Pengcheng He

arXiv:2507.22219·cs.CL·December 22, 2025

RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation

Dongyub Jude Lee, Zhenyi Ye, Pengcheng He

PDF

TL;DR

This paper introduces RLfR, a reinforcement learning approach for machine translation that refines translations through on-policy teacher-guided edits, improving semantic quality without relying on preference triplets.

Contribution

RLfR replaces static preference triplets with on-policy teacher-guided refinements, enabling stable, model-aware learning for MT without explicit preference datasets.

Findings

01

RLfR outperforms baseline methods on FLORES-200 across multiple languages.

02

RLfR improves semantic quality and entity preservation in translations.

03

RLfR achieves superior results under LLM-based judge evaluations.

Abstract

Preference-learning methods for machine translation (MT), such as Direct Preference Optimization (DPO), have shown strong gains but typically rely on large, carefully curated preference triplets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), which replaces static triplets with on-policy, actor-conditioned refinements produced by a frozen teacher. At each step, the actor samples candidate translations, the teacher performs a minimal local edit of each draft, and the actor is reinforced to close the gap using a composite reward that combines scaled negative edit distance for lexical and structural fidelity with COMET for semantic adequacy. This formulation yields a stable, model-aware learning signal without requiring explicit preference datasets. Experiments on FLORES-200 (English to German, Spanish,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.