Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency

Kathleen C. Fraser; Hillary Dawkins; Isar Nejadgholi; Svetlana Kiritchenko

arXiv:2506.17209·cs.CL·June 23, 2025

Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency

Kathleen C. Fraser, Hillary Dawkins, Isar Nejadgholi, Svetlana Kiritchenko

PDF

Open Access 1 Video

TL;DR

Fine-tuning large language models can unintentionally reduce safety features and cause inconsistent evaluation results, highlighting the need for more reliable safety assessment methods.

Contribution

This paper reveals the variability in safety evaluation outcomes due to minor procedural changes and stochastic factors in LLM fine-tuning.

Findings

01

Safety benchmark results vary significantly with trivial setup changes

02

Fine-tuning can diminish safety alignment even without harmful data

03

Evaluation reproducibility is a major concern in LLM safety assessment

Abstract

Fine-tuning a general-purpose large language model (LLM) for a specific domain or task has become a routine procedure for ordinary users. However, fine-tuning is known to remove the safety alignment features of the model, even when the fine-tuning data does not contain any harmful content. We consider this to be a critical failure mode of LLMs due to the widespread uptake of fine-tuning, combined with the benign nature of the "attack". Most well-intentioned developers are likely unaware that they are deploying an LLM with reduced safety. On the other hand, this known vulnerability can be easily exploited by malicious actors intending to bypass safety guardrails. To make any meaningful progress in mitigating this issue, we first need reliable and reproducible safety evaluations. In this work, we investigate how robust a safety benchmark is to trivial variations in the experimental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Natural Language Processing Techniques