Layer-wise Swapping for Generalizable Multilingual Safety

Hyunseo Shin; Wonseok Hwang

arXiv:2601.22620·cs.CL·February 16, 2026

Layer-wise Swapping for Generalizable Multilingual Safety

Hyunseo Shin, Wonseok Hwang

PDF

Open Access 1 Video

TL;DR

This paper introduces a layer swapping technique that transfers safety alignment from English to low-resource languages in LLMs, improving safety without sacrificing general language understanding.

Contribution

It proposes a novel safety-aware layer swapping method that adaptively transfers safety alignment across languages without additional training.

Findings

01

Achieves safety improvements on multilingual benchmarks

02

Maintains performance on general language understanding tasks

03

Produces more aligned and less harmful responses

Abstract

Despite the rapid advancements of Large Language Models (LLMs), safety risks remain a critical challenge for low-resource languages. Existing safety datasets are predominantly English centric, limiting progress in multilingual safety alignment. As a result, low resource expert models, finetuned on their respective instruction datasets, tend to exhibit higher unsafety rates compared to their high resource counterparts. In this work, we propose a safety aware layer swapping method that transfers safety alignment from an English safety expert to low resource language experts without additional training. To further enhance transfer ability, our method adaptively selects or blends modules based on their degree of specialization. Our approach preserves performance on general language understanding tasks while enhancing safety in the target languages. Experimental results show that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Layer-wise Swapping for Generalizable Multilingual Safety· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Adversarial Robustness in Machine Learning