Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

Hua Farn; Hsuan Su; Shachi H Kumar; Saurav Sahay; Shang-Tse Chen; Hung-yi Lee

arXiv:2412.19512·cs.CL·August 29, 2025

Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

Hua Farn, Hsuan Su, Shachi H Kumar, Saurav Sahay, Shang-Tse Chen, Hung-yi Lee

PDF

Open Access

TL;DR

This paper proposes a weight-merging technique to preserve safety and improve performance in fine-tuned LLMs without needing additional safety data, addressing safety degradation issues.

Contribution

The authors introduce a simple weight-merging approach that effectively maintains safety and boosts performance in fine-tuned LLMs without extra safety datasets.

Findings

01

Merging pre- and post-fine-tuned model weights mitigates safety degradation.

02

The method improves downstream task performance.

03

Experiments validate the approach's practicality and effectiveness.

Abstract

Fine-tuning large language models (LLMs) for downstream tasks often leads to catastrophic forgetting, notably degrading the safety of originally aligned models. While some existing methods attempt to restore safety by incorporating additional safety data, the quality of such data typically falls short of that used in the original alignment process. Moreover, these high-quality safety datasets are generally inaccessible, making it difficult to fully recover the model's original safety. We ask: How can we preserve safety while improving downstream task performance without additional safety data? We show that simply merging the weights of pre- and post-fine-tuned models effectively mitigates safety degradation while enhancing performance. Experiments across different downstream tasks and models validate the method's practicality and effectiveness.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemiconductor materials and devices