A safety realignment framework via subspace-oriented model fusion for   large language models

Xin Yi; Shunfan Zheng; Linlin Wang; Xiaoling Wang; Liang He

arXiv:2405.09055·cs.CL·May 16, 2024

A safety realignment framework via subspace-oriented model fusion for large language models

Xin Yi, Shunfan Zheng, Linlin Wang, Xiaoling Wang, Liang He

PDF

Open Access 1 Repo 8 Models

TL;DR

This paper proposes a subspace-oriented model fusion framework to enhance the safety of large language models, effectively balancing safety and task performance during model realignment.

Contribution

It introduces a novel safety realignment method using subspace masking to fuse safety features with task-specific models, reducing catastrophic forgetting.

Findings

01

Preserves safety during model fusion without performance loss

02

Effective across multiple languages and tasks

03

Maintains safety and task capabilities simultaneously

Abstract

The current safeguard mechanisms for large language models (LLMs) are indeed susceptible to jailbreak attacks, making them inherently fragile. Even the process of fine-tuning on apparently benign data for downstream tasks can jeopardize safety. One potential solution is to conduct safety fine-tuning subsequent to downstream fine-tuning. However, there's a risk of catastrophic forgetting during safety fine-tuning, where LLMs may regain safety measures but lose the task-specific knowledge acquired during downstream fine-tuning. In this paper, we introduce a safety realignment framework through subspace-oriented model fusion (SOMF), aiming to combine the safeguard capabilities of initially aligned model and the current fine-tuned model into a realigned model. Our approach begins by disentangling all task vectors from the weights of each fine-tuned model. We then identify safety-related…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xinykou/safety_realignment
noneOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Occupational Health and Safety Research