Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets

Ning Lu; Shengcai Liu; Jiahao Wu; Weiyu Chen; Zhirui Zhang; Yew-Soon Ong; Qi Wang; Ke Tang

arXiv:2505.12038·cs.LG·May 20, 2025

Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets

Ning Lu, Shengcai Liu, Jiahao Wu, Weiyu Chen, Zhirui Zhang, Yew-Soon Ong, Qi Wang, Ke Tang

PDF

Open Access 1 Repo

TL;DR

Safe Delta is a novel method for fine-tuning large language models that maintains safety standards across diverse datasets without sacrificing utility, addressing safety risks introduced during customization.

Contribution

It introduces a safety-aware post-training adjustment technique that estimates safety degradation and optimizes parameter changes to preserve safety during fine-tuning.

Findings

01

Consistently preserves safety across multiple datasets.

02

Maintains utility gains while limiting safety loss.

03

Effective across diverse fine-tuning scenarios.

Abstract

Large language models (LLMs) have shown great potential as general-purpose AI assistants across various domains. To fully leverage this potential in specific applications, many companies provide fine-tuning API services, enabling users to upload their own data for LLM customization. However, fine-tuning services introduce a new safety threat: user-uploaded data, whether harmful or benign, can break the model's alignment, leading to unsafe outputs. Moreover, existing defense methods struggle to address the diversity of fine-tuning datasets (e.g., varying sizes, tasks), often sacrificing utility for safety or vice versa. To address this issue, we propose Safe Delta, a safety-aware post-training defense method that adjusts the delta parameters (i.e., the parameter change before and after fine-tuning). Specifically, Safe Delta estimates the safety degradation, selects delta parameters to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

colinlu50/safedelta
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)