Towards Comprehensive Post Safety Alignment of Large Language Models via   Safety Patching

Weixiang Zhao; Yulin Hu; Zhuojun Li; Yang Deng; Jiahe Guo; Xingyu Sui,; Yanyan Zhao; Bing Qin; Tat-Seng Chua; Ting Liu

arXiv:2405.13820·cs.CL·December 18, 2024·1 cites

Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching

Weixiang Zhao, Yulin Hu, Zhuojun Li, Yang Deng, Jiahe Guo, Xingyu Sui,, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu

PDF

Open Access

TL;DR

This paper introduces SafePatching, a novel post safety alignment method for large language models that enhances safety, reduces over-safety, and preserves utility without retraining from scratch.

Contribution

The paper presents SafePatching, a comprehensive framework for post safety alignment that effectively improves safety and utility balance in large language models.

Findings

01

SafePatching outperforms baseline methods in safety and utility metrics.

02

It achieves more comprehensive safety improvements across multiple LLMs.

03

Demonstrates effectiveness in continual safety alignment scenarios.

Abstract

Safety alignment of large language models (LLMs) has been gaining increasing attention. However, current safety-aligned LLMs suffer from the fragile and imbalanced safety mechanisms, which can still be induced to generate unsafe responses, exhibit over-safety by rejecting safe user inputs, and fail to preserve general utility after safety alignment. To this end, we propose a novel post safety alignment (PSA) method to address these inherent and emerging safety challenges, including safety enhancement, over-safety mitigation, and utility preservation. In specific, we introduce \textsc{SafePatching}, a novel framework for comprehensive PSA, where two distinct safety patches are developed on the harmful data to enhance safety and mitigate over-safety concerns, and then seamlessly integrated into the target LLM backbone without compromising its utility. Extensive experiments on four…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling