A safety realignment framework via subspace-oriented model fusion for large language models
Xin Yi, Shunfan Zheng, Linlin Wang, Xiaoling Wang, Liang He

TL;DR
This paper proposes a subspace-oriented model fusion framework to enhance the safety of large language models, effectively balancing safety and task performance during model realignment.
Contribution
It introduces a novel safety realignment method using subspace masking to fuse safety features with task-specific models, reducing catastrophic forgetting.
Findings
Preserves safety during model fusion without performance loss
Effective across multiple languages and tasks
Maintains safety and task capabilities simultaneously
Abstract
The current safeguard mechanisms for large language models (LLMs) are indeed susceptible to jailbreak attacks, making them inherently fragile. Even the process of fine-tuning on apparently benign data for downstream tasks can jeopardize safety. One potential solution is to conduct safety fine-tuning subsequent to downstream fine-tuning. However, there's a risk of catastrophic forgetting during safety fine-tuning, where LLMs may regain safety measures but lose the task-specific knowledge acquired during downstream fine-tuning. In this paper, we introduce a safety realignment framework through subspace-oriented model fusion (SOMF), aiming to combine the safeguard capabilities of initially aligned model and the current fine-tuned model into a realigned model. Our approach begins by disentangling all task vectors from the weights of each fine-tuned model. We then identify safety-related…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Occupational Health and Safety Research
