Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models
Lucas Bandarkar, Benjamin Muller, Pritish Yuvraj, Rui Hou, Nayan Singhal, Hongjiang Lv, Bing Liu

TL;DR
This paper introduces a simple layer swapping technique to enhance cross-lingual mathematical reasoning in large language models without additional training, outperforming existing methods on multilingual math benchmarks.
Contribution
It proposes a novel model merging method that combines language and math expertise by swapping layers, enabling effective zero-shot cross-lingual transfer in LLMs.
Findings
Outperforms other merging methods on MGSM benchmark by 10% across four languages
Layer swapping is simple, inexpensive, and based on interpretative analysis
Enables modular and post hoc transfer of reasoning capabilities
Abstract
Model merging, such as model souping, is the practice of combining different models with the same architecture together without further training. In this work, we present a model merging methodology that addresses the difficulty of fine-tuning Large Language Models (LLMs) for target tasks in non-English languages, where task-specific data is often unavailable. We focus on mathematical reasoning and without in-language math data, facilitate cross-lingual transfer by composing language and math capabilities. Starting from the same pretrained model, we fine-tune separate "experts" on math instruction data in English and on generic instruction data in the target language. We then replace the top and bottom transformer layers of the math expert directly with layers from the language expert, which consequently enhances math performance in the target language. The resulting merged models…
Peer Reviews
Decision·ICLR 2025 Spotlight
* Interesting idea and its evaluation on 1 model and 4 languages, with additional experiments * Although the setup raises some questions (limited evaluation, why not freeze layers and avoid having to soup the transition layers, etc.), the expanded evaluation on Swahili and the limitations section address most of these * excellent writing, justification and presentation
* limited evaluation: only 1 model and one set of tasks (math)
1. The proposed methodology is highly practical in scenarios where one might have publicly available task-specific data in a high-resource language and generic instruction data in the low-resource language. The model parameter adjustments being fully post-hoc eliminate any additional computational overhead apart from the initial fine-tuning required to create task and language experts. 2. *layer swapping* with the best configuration consistently outperforms the individual SFT experts, the base
1. The methodology is evaluated only on Llama 3.1, using MGSM benchmark for 4 selective languages. In my opinion, evaluation of the method on a single model, single benchmark and limited languages makes the conclusion less generalizable. While the languages used in the study are diverse, incorporating more datasets and models (in terms of different architecture or pre-training) can strengthen the conclusion. 2. The assumption of availability of generic instruction data for low-resource language
- The paper introduces an efficient, innovative layer-swapping method for zero-shot cross-lingual transfer in LLMs, addressing the lack of task-specific data in low-resource languages with simplicity and strong empirical results. - This technique is particularly notable for its straightforward implementation, allowing effective merging of task and language expertise without complex adjustments, making it a practical alternative to standard methods like model souping. - Promising experimental
- The method is tested only on math reasoning, leaving it unclear if layer swapping generalizes to other tasks. Additional evaluations on tasks like question-answering or translation would strengthen the claims of broad applicability. - While the paper mentions different layer-swapping configurations, it lacks in-depth analysis on which configurations work best and why. A more detailed study of these choices would help to better understand the method make it more robust. For example, provide abl
Videos
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques
MethodsFocus
