TL;DR
SSR-Zero introduces a reference-free, self-rewarding reinforcement learning framework for machine translation, achieving state-of-the-art results with fully online training on monolingual data.
Contribution
It presents the first fully self-rewarding, reference-free RL method for MT that surpasses existing models and can be combined with external supervision for further improvements.
Findings
SSR-Zero outperforms existing MT-specific LLMs and larger general LLMs in English-Chinese translation.
Augmenting SSR with external supervision from COMET yields state-of-the-art performance.
Self-rewarding mechanism is more effective than external LLM-as-a-judge in MT.
Abstract
Large language models (LLMs) have recently demonstrated remarkable capabilities in machine translation (MT). However, most advanced MT-specific LLMs heavily rely on external supervision signals during training, such as human-annotated reference data or trained reward models (RMs), which are often expensive to obtain and challenging to scale. To overcome this limitation, we propose a Simple Self-Rewarding (SSR) Reinforcement Learning (RL) framework for MT that is reference-free, fully online, and relies solely on self-judging rewards. Training with SSR using 13K monolingual examples and Qwen-2.5-7B as the backbone, our model SSR-Zero-7B outperforms existing MT-specific LLMs, e.g., TowerInstruct-13B and GemmaX-28-9B, as well as larger general LLMs like Qwen2.5-32B-Instruct in English Chinese translation tasks from WMT23, WMT24, and Flores200 benchmarks. Furthermore, by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
