Cross-lingual Transfer of Reward Models in Multilingual Alignment
Jiwoo Hong, Noah Lee, Rodrigo Mart\'inez-Casta\~no, C\'esar, Rodr\'iguez, James Thorne

TL;DR
This paper investigates the cross-lingual transfer of reward models trained in multiple languages, demonstrating significant improvements in multilingual reinforcement learning and alignment, with extensive analysis and released resources.
Contribution
It provides empirical evidence of strong cross-lingual transfer of reward models and analyzes the underlying representation shifts, advancing multilingual RLHF methods.
Findings
English RMs outperform target language RMs by 3-4% on Multilingual RewardBench.
Cross-lingual transfer enhances multilingual instruction-following capabilities.
Extensive analysis and resources are released for further research.
Abstract
Reinforcement learning with human feedback (RLHF) is shown to largely benefit from precise reward models (RMs). However, recent studies in reward modeling schemes are skewed towards English, limiting the applicability of RLHF in multilingual alignments. In this work, we investigate the cross-lingual transfer of RMs trained in diverse languages, primarily from English. Our experimental results demonstrate the strong cross-lingual transfer of English RMs, exceeding target language RMs by 3~4% average increase in Multilingual RewardBench. Furthermore, we analyze the cross-lingual transfer of RMs through the representation shifts. Finally, we perform multilingual alignment to exemplify how cross-lingual transfer in RM propagates to enhanced multilingual instruction-following capability, along with extensive analyses on off-the-shelf RMs. We release the code, model, and data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEmployee Welfare and Language Studies
