Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching
Weixiang Zhao, Yulin Hu, Zhuojun Li, Yang Deng, Jiahe Guo, Xingyu Sui,, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu

TL;DR
This paper introduces SafePatching, a novel post safety alignment method for large language models that enhances safety, reduces over-safety, and preserves utility without retraining from scratch.
Contribution
The paper presents SafePatching, a comprehensive framework for post safety alignment that effectively improves safety and utility balance in large language models.
Findings
SafePatching outperforms baseline methods in safety and utility metrics.
It achieves more comprehensive safety improvements across multiple LLMs.
Demonstrates effectiveness in continual safety alignment scenarios.
Abstract
Safety alignment of large language models (LLMs) has been gaining increasing attention. However, current safety-aligned LLMs suffer from the fragile and imbalanced safety mechanisms, which can still be induced to generate unsafe responses, exhibit over-safety by rejecting safe user inputs, and fail to preserve general utility after safety alignment. To this end, we propose a novel post safety alignment (PSA) method to address these inherent and emerging safety challenges, including safety enhancement, over-safety mitigation, and utility preservation. In specific, we introduce \textsc{SafePatching}, a novel framework for comprehensive PSA, where two distinct safety patches are developed on the harmful data to enhance safety and mitigate over-safety concerns, and then seamlessly integrated into the target LLM backbone without compromising its utility. Extensive experiments on four…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
