TL;DR
This paper introduces a multilingual unification learning approach that enhances large language models' reasoning efficiency and reduces data and inference requirements by leveraging diverse multilingual data.
Contribution
The paper proposes the novel L^2 multilingual unification learning method, improving reasoning performance and efficiency with minimal data, and demonstrating its orthogonality to other data-efficient techniques.
Findings
Multilingual learning reduces data and token requirements.
Small amounts of multilingual data significantly improve reasoning.
L^2 method is orthogonal to other data-efficient approaches.
Abstract
This paper explores the challenges of test-time scaling of large language models (LLMs), regarding both the data and inference efficiency. We highlight the diversity of multi-lingual reasoning based on our pilot studies, and then introduce a novel approach, \(L^2\) multi-lingual unification learning with a decoding intervention strategy for further investigation. The basic idea of \(L^2\) is that the reasoning process varies across different languages, which may be mutually beneficial to enhance both model performance and efficiency. In specific, there are two types of multi-lingual data: the entire long chain-of-thought annotations in different languages and the step-wise mixture of languages. By further tuning based on them, we show that even small amounts of data can significantly improve reasoning capabilities. Our findings suggest that multilingual learning reduces both the…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The idea of leveraging multilingual reasoning diversity (rather than simply more data) to increase reasoning efficiency is novel and sound. 2. The authors showed strong empirical gains even with extremely small annotated sample sizes when augmented via multilingual CoT. 3. Addresses both data efficiency and inference efficiency which is increasingly important in practice for LLM deployment.
1. The experiments use relatively small and controlled benchmark sizes e.g., AIME24 with only 30 problems. It's not very clear how much it would scale to broader diverse reasoning tasks. 2. While the reduction in inference tokens is claimed, detailed breakdowns of token savings vs accuracy trade-offs (e.g. across languages and varying lengths) are not sufficiently discussed.
The novel data augmentation method.
The authors compare their method to some unnamed but presumably simplistic technique for data augmentation, I think that it is better to compare to several existing techniques, e.g. starting with classic backtranslation. I suppose that the results could be explained by data augmentation itself, not the language diversity.
1. Multilingual thinking as a resource for efficient reasoning is compelling. Using different language to make reasoning more efficient is a novel way to explore. The idea of different languages induce distinct reasoning compression patterns is interesting.
1. There are some previous works already discussed using multilingual data to boost reasoning performance [1], maybe you should consider to compare your method with theirs and tell more differences. 2. Only using tokens number as the metric of measuring efficiency, need also consider using other metrics like the time of inference, FLOP cost, and memory usages. 3. When using tokens in other languages, the factor of compression rate of the tokenizers in that language should also be considered,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
