Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models

Di Wu; Xin Lu; Yanyan Zhao; Bing Qin

arXiv:2412.11041·cs.CL·May 27, 2025

Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models

Di Wu, Xin Lu, Yanyan Zhao, Bing Qin

PDF

Open Access 1 Repo

TL;DR

This paper introduces IRR, a post-hoc safety realignment method for fine-tuned LLMs that removes unsafe parameters to improve safety without sacrificing task performance.

Contribution

The paper presents IRR, a novel technique for safety re-alignment of LLMs by identifying and removing unsafe parameters after fine-tuning.

Findings

01

IRR significantly improves safety benchmark performance.

02

IRR maintains downstream task performance.

03

Effective across various fine-tuning methods.

Abstract

Although large language models (LLMs) achieve effective safety alignment at the time of release, they still face various safety challenges. A key issue is that fine-tuning often compromises the safety alignment of LLMs. To address this issue, we propose a method named IRR (Identify, Remove, and Recalibrate for Safety Realignment) that performs safety realignment for LLMs. The core of IRR is to identify and remove unsafe delta parameters from the fine-tuned models, while recalibrating the retained ones. We evaluate the effectiveness of IRR across various datasets, including both full fine-tuning and LoRA methods. Our results demonstrate that IRR significantly enhances the safety performance of fine-tuned models on safety benchmarks, such as harmful queries and jailbreak attacks, while maintaining their performance on downstream tasks. The source code is available at:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pikepokenew/IRR
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling