Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency
Kathleen C. Fraser, Hillary Dawkins, Isar Nejadgholi, Svetlana Kiritchenko

TL;DR
Fine-tuning large language models can unintentionally reduce safety features and cause inconsistent evaluation results, highlighting the need for more reliable safety assessment methods.
Contribution
This paper reveals the variability in safety evaluation outcomes due to minor procedural changes and stochastic factors in LLM fine-tuning.
Findings
Safety benchmark results vary significantly with trivial setup changes
Fine-tuning can diminish safety alignment even without harmful data
Evaluation reproducibility is a major concern in LLM safety assessment
Abstract
Fine-tuning a general-purpose large language model (LLM) for a specific domain or task has become a routine procedure for ordinary users. However, fine-tuning is known to remove the safety alignment features of the model, even when the fine-tuning data does not contain any harmful content. We consider this to be a critical failure mode of LLMs due to the widespread uptake of fine-tuning, combined with the benign nature of the "attack". Most well-intentioned developers are likely unaware that they are deploying an LLM with reduced safety. On the other hand, this known vulnerability can be easily exploited by malicious actors intending to bypass safety guardrails. To make any meaningful progress in mitigating this issue, we first need reliable and reproducible safety evaluations. In this work, we investigate how robust a safety benchmark is to trivial variations in the experimental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Natural Language Processing Techniques
