Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models

Lucas Bandarkar; Benjamin Muller; Pritish Yuvraj; Rui Hou; Nayan Singhal; Hongjiang Lv; Bing Liu

arXiv:2410.01335·cs.CL·May 27, 2025

Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models

Lucas Bandarkar, Benjamin Muller, Pritish Yuvraj, Rui Hou, Nayan Singhal, Hongjiang Lv, Bing Liu

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces a simple layer swapping technique to enhance cross-lingual mathematical reasoning in large language models without additional training, outperforming existing methods on multilingual math benchmarks.

Contribution

It proposes a novel model merging method that combines language and math expertise by swapping layers, enabling effective zero-shot cross-lingual transfer in LLMs.

Findings

01

Outperforms other merging methods on MGSM benchmark by 10% across four languages

02

Layer swapping is simple, inexpensive, and based on interpretative analysis

03

Enables modular and post hoc transfer of reasoning capabilities

Abstract

Model merging, such as model souping, is the practice of combining different models with the same architecture together without further training. In this work, we present a model merging methodology that addresses the difficulty of fine-tuning Large Language Models (LLMs) for target tasks in non-English languages, where task-specific data is often unavailable. We focus on mathematical reasoning and without in-language math data, facilitate cross-lingual transfer by composing language and math capabilities. Starting from the same pretrained model, we fine-tune separate "experts" on math instruction data in English and on generic instruction data in the target language. We then replace the top and bottom transformer layers of the math expert directly with layers from the language expert, which consequently enhances math performance in the target language. The resulting merged models…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 8Confidence 4

Strengths

* Interesting idea and its evaluation on 1 model and 4 languages, with additional experiments * Although the setup raises some questions (limited evaluation, why not freeze layers and avoid having to soup the transition layers, etc.), the expanded evaluation on Swahili and the limitations section address most of these * excellent writing, justification and presentation

Weaknesses

* limited evaluation: only 1 model and one set of tasks (math)

Reviewer 02Rating 6Confidence 4

Strengths

1. The proposed methodology is highly practical in scenarios where one might have publicly available task-specific data in a high-resource language and generic instruction data in the low-resource language. The model parameter adjustments being fully post-hoc eliminate any additional computational overhead apart from the initial fine-tuning required to create task and language experts. 2. *layer swapping* with the best configuration consistently outperforms the individual SFT experts, the base

Weaknesses

1. The methodology is evaluated only on Llama 3.1, using MGSM benchmark for 4 selective languages. In my opinion, evaluation of the method on a single model, single benchmark and limited languages makes the conclusion less generalizable. While the languages used in the study are diverse, incorporating more datasets and models (in terms of different architecture or pre-training) can strengthen the conclusion. 2. The assumption of availability of generic instruction data for low-resource language

Reviewer 03Rating 8Confidence 4

Strengths

- The paper introduces an efficient, innovative layer-swapping method for zero-shot cross-lingual transfer in LLMs, addressing the lack of task-specific data in low-resource languages with simplicity and strong empirical results. - This technique is particularly notable for its straightforward implementation, allowing effective merging of task and language expertise without complex adjustments, making it a practical alternative to standard methods like model souping. - Promising experimental

Weaknesses

- The method is tested only on math reasoning, leaving it unclear if layer swapping generalizes to other tasks. Additional evaluations on tasks like question-answering or translation would strengthen the claims of broad applicability. - While the paper mentions different layer-swapping configurations, it lacks in-depth analysis on which configurations work best and why. A more detailed study of these choices would help to better understand the method make it more robust. For example, provide abl

Videos

Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models· slideslive

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques

MethodsFocus