Multilingual Safety Alignment Via Sparse Weight Editing
Jiaming Liang, Zhaoxin Wang, Handing Wang

TL;DR
This paper introduces a training-free, sparse weight editing method to improve safety alignment of large language models across multiple languages, especially low-resource ones, with minimal impact on their general capabilities.
Contribution
It presents a novel, closed-form solution for cross-lingual safety alignment by sparsely editing model weights, avoiding costly retraining or data requirements.
Findings
Significantly reduces attack success rates in low-resource languages.
Maintains general reasoning abilities with negligible performance loss.
Effective across multiple languages and model architectures.
Abstract
Large Language Models (LLMs) exhibit significant safety disparities across languages, with low-resource languages (LRLs) often bypassing safety guardrails established for high-resource languages (HRLs) like English. Existing solutions, such as multilingual supervised fine-tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), are computationally expensive and dependent on scarce multilingual safety data. In this work, we propose a novel, training-free alignment framework based on Sparse Weight Editing. Identifying that safety capabilities are localized within a sparse set of safety neurons, we formulate the cross-lingual alignment problem as a constrained linear transformation. We derive a closed-form solution to optimally map the harmful representations of LRLs to the robust safety subspaces of HRLs, while preserving general utility via a null-space projection constraint.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
