Language Imbalance Driven Rewarding for Multilingual Self-improving

Wen Yang; Junhong Wu; Chen Wang; Chengqing Zong; Jiajun Zhang

arXiv:2410.08964·cs.CL·February 27, 2025

Language Imbalance Driven Rewarding for Multilingual Self-improving

Wen Yang, Junhong Wu, Chen Wang, Chengqing Zong, Jiajun Zhang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a novel method leveraging language imbalance as a reward signal to iteratively improve multilingual capabilities of large language models, significantly enhancing performance across multiple languages.

Contribution

It proposes Language Imbalance Driven Rewarding, a new approach that uses inherent language imbalances as a reward to self-improve LLMs in non-dominant languages.

Findings

01

Improved multilingual performance on instruction-following tasks.

02

Enhanced arithmetic reasoning accuracy across languages.

03

7.46% average win rate increase on X-AlpacaEval.

Abstract

Large Language Models (LLMs) have achieved state-of-the-art performance across numerous tasks. However, these advancements have predominantly benefited "first-class" languages such as English and Chinese, leaving many other languages underrepresented. This imbalance, while limiting broader applications, generates a natural preference ranking between languages, offering an opportunity to bootstrap the multilingual capabilities of LLM in a self-improving manner. Thus, we propose $Language Imbalance Driven Rewarding$ , where the inherent imbalance between dominant and non-dominant languages within LLMs is leveraged as a reward signal. Iterative DPO training demonstrates that this approach not only enhances LLM performance in non-dominant languages but also improves the dominant language's capacity, thereby yielding an iterative reward signal. Fine-tuning Meta-Llama-3-8B-Instruct…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 3Confidence 4

Strengths

The method is effective in improving the performance of general instruction following and mathematical reasoning. The paper also presents analysis for the effectiveness and accuracy of the preference pairs, which serves as a nice support for the method.

Weaknesses

The paper is not the first attempt to improve the multilingual ability of LLMs by cross-lingual optimizations. In one of the cited paper, She et al., ACL 2024, where experiments are conducted in optimizing the preference with DPO and an off-the-shell translator. I find the proposed method quite similar to the above one, but in this paper there is no clear indication of the potential relations.

Reviewer 02Rating 6Confidence 4

Strengths

- I think the idea of exploiting the gap in an LLM's inherent language capability for self-improvement is interesting and intuitive. - The paper has extensive experiments across open-ended and close-ended benchmarks. Results are consistent and in favour of the proposed method. It is clear that through iterative data synthesis and training, models can progressively improve in both dominant languages (`dl`) and non-dominant languages (`nl`).

Weaknesses

1. Several factors in asserting the assumptions could not be carefully controlled. Table 1 line 180: GPT4-as-a-judge is used to confirm the quality of responses in different languages, however, there is no guarantee that the scores for different languages are on the same scale and that GPT4 is able to judge those languages using the same "standard". This could apply to Table 2 and Table 3 too. 2. I find the motivation of creating (`nl`, `dl->nl`) preference data reasonable. However, I did not f

Reviewer 03Rating 6Confidence 4

Strengths

1. This work suggests a promising direction for self-improving multilingual LLMs by leveraging intrinsic language imbalance. 2. Experimental results demonstrate significant improvements over baseline models on multilingual benchmarks. 3. The method is clearly explained, which is easy to understand.

Weaknesses

1. The evaluation heavily relies on GPT-4 to assess response quality, but its reliability as a multilingual judge is uncertain, which was discuessed in [1]. The evaluation in Table 1 across various languages may be biased because it's unclear if GPT-4 rates identical responses equally in different languages. A fairer method would be to evaluate responses in the same language, including both directly generated and translated ones, to ensure scores are more comparable. 2. The method depends on LL

Code & Models

Repositories

znlp/language-imbalance-driven-rewarding
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational and Psychological Assessments

MethodsDirect Preference Optimization