RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation
Dongyub Jude Lee, Zhenyi Ye, Pengcheng He

TL;DR
This paper introduces RLfR, a reinforcement learning approach for machine translation that refines translations through on-policy teacher-guided edits, improving semantic quality without relying on preference triplets.
Contribution
RLfR replaces static preference triplets with on-policy teacher-guided refinements, enabling stable, model-aware learning for MT without explicit preference datasets.
Findings
RLfR outperforms baseline methods on FLORES-200 across multiple languages.
RLfR improves semantic quality and entity preservation in translations.
RLfR achieves superior results under LLM-based judge evaluations.
Abstract
Preference-learning methods for machine translation (MT), such as Direct Preference Optimization (DPO), have shown strong gains but typically rely on large, carefully curated preference triplets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), which replaces static triplets with on-policy, actor-conditioned refinements produced by a frozen teacher. At each step, the actor samples candidate translations, the teacher performs a minimal local edit of each draft, and the actor is reinforced to close the gap using a composite reward that combines scaled negative edit distance for lexical and structural fidelity with COMET for semantic adequacy. This formulation yields a stable, model-aware learning signal without requiring explicit preference datasets. Experiments on FLORES-200 (English to German, Spanish,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
