Language Imbalance Driven Rewarding for Multilingual Self-improving
Wen Yang, Junhong Wu, Chen Wang, Chengqing Zong, Jiajun Zhang

TL;DR
This paper introduces a novel method leveraging language imbalance as a reward signal to iteratively improve multilingual capabilities of large language models, significantly enhancing performance across multiple languages.
Contribution
It proposes Language Imbalance Driven Rewarding, a new approach that uses inherent language imbalances as a reward to self-improve LLMs in non-dominant languages.
Findings
Improved multilingual performance on instruction-following tasks.
Enhanced arithmetic reasoning accuracy across languages.
7.46% average win rate increase on X-AlpacaEval.
Abstract
Large Language Models (LLMs) have achieved state-of-the-art performance across numerous tasks. However, these advancements have predominantly benefited "first-class" languages such as English and Chinese, leaving many other languages underrepresented. This imbalance, while limiting broader applications, generates a natural preference ranking between languages, offering an opportunity to bootstrap the multilingual capabilities of LLM in a self-improving manner. Thus, we propose , where the inherent imbalance between dominant and non-dominant languages within LLMs is leveraged as a reward signal. Iterative DPO training demonstrates that this approach not only enhances LLM performance in non-dominant languages but also improves the dominant language's capacity, thereby yielding an iterative reward signal. Fine-tuning Meta-Llama-3-8B-Instruct…
Peer Reviews
Decision·ICLR 2025 Poster
The method is effective in improving the performance of general instruction following and mathematical reasoning. The paper also presents analysis for the effectiveness and accuracy of the preference pairs, which serves as a nice support for the method.
The paper is not the first attempt to improve the multilingual ability of LLMs by cross-lingual optimizations. In one of the cited paper, She et al., ACL 2024, where experiments are conducted in optimizing the preference with DPO and an off-the-shell translator. I find the proposed method quite similar to the above one, but in this paper there is no clear indication of the potential relations.
- I think the idea of exploiting the gap in an LLM's inherent language capability for self-improvement is interesting and intuitive. - The paper has extensive experiments across open-ended and close-ended benchmarks. Results are consistent and in favour of the proposed method. It is clear that through iterative data synthesis and training, models can progressively improve in both dominant languages (`dl`) and non-dominant languages (`nl`).
1. Several factors in asserting the assumptions could not be carefully controlled. Table 1 line 180: GPT4-as-a-judge is used to confirm the quality of responses in different languages, however, there is no guarantee that the scores for different languages are on the same scale and that GPT4 is able to judge those languages using the same "standard". This could apply to Table 2 and Table 3 too. 2. I find the motivation of creating (`nl`, `dl->nl`) preference data reasonable. However, I did not f
1. This work suggests a promising direction for self-improving multilingual LLMs by leveraging intrinsic language imbalance. 2. Experimental results demonstrate significant improvements over baseline models on multilingual benchmarks. 3. The method is clearly explained, which is easy to understand.
1. The evaluation heavily relies on GPT-4 to assess response quality, but its reliability as a multilingual judge is uncertain, which was discuessed in [1]. The evaluation in Table 1 across various languages may be biased because it's unclear if GPT-4 rates identical responses equally in different languages. A fairer method would be to evaluate responses in the same language, including both directly generated and translated ones, to ensure scores are more comparable. 2. The method depends on LL
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducational and Psychological Assessments
MethodsDirect Preference Optimization
