Multilingual Safety Alignment Via Sparse Weight Editing

Jiaming Liang; Zhaoxin Wang; Handing Wang

arXiv:2602.22554·cs.LG·February 27, 2026

Multilingual Safety Alignment Via Sparse Weight Editing

Jiaming Liang, Zhaoxin Wang, Handing Wang

PDF

Open Access

TL;DR

This paper introduces a training-free, sparse weight editing method to improve safety alignment of large language models across multiple languages, especially low-resource ones, with minimal impact on their general capabilities.

Contribution

It presents a novel, closed-form solution for cross-lingual safety alignment by sparsely editing model weights, avoiding costly retraining or data requirements.

Findings

01

Significantly reduces attack success rates in low-resource languages.

02

Maintains general reasoning abilities with negligible performance loss.

03

Effective across multiple languages and model architectures.

Abstract

Large Language Models (LLMs) exhibit significant safety disparities across languages, with low-resource languages (LRLs) often bypassing safety guardrails established for high-resource languages (HRLs) like English. Existing solutions, such as multilingual supervised fine-tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), are computationally expensive and dependent on scarce multilingual safety data. In this work, we propose a novel, training-free alignment framework based on Sparse Weight Editing. Identifying that safety capabilities are localized within a sparse set of safety neurons, we formulate the cross-lingual alignment problem as a constrained linear transformation. We derive a closed-form solution to optimally map the harmful representations of LRLs to the robust safety subspaces of HRLs, while preserving general utility via a null-space projection constraint.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)