Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Minseon Kim; Jin Myung Kwak; Lama Alssum; Bernard Ghanem; Philip Torr; David Krueger; Fazl Barez; Adel Bibi

arXiv:2508.12531·cs.LG·August 19, 2025

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Minseon Kim, Jin Myung Kwak, Lama Alssum, Bernard Ghanem, Philip Torr, David Krueger, Fazl Barez, Adel Bibi

PDF

Open Access

TL;DR

This paper shows that safety issues in fine-tuning language models are mainly due to optimization choices, and proposes simple hyper-parameter tuning and an EMA technique to improve safety without sacrificing utility.

Contribution

It demonstrates that safety problems during fine-tuning can be mitigated through proper optimization hyper-parameters and introduces an EMA method to preserve safety properties.

Findings

01

Reducing unsafe responses from 16% to 5% with hyper-parameter tuning.

02

EMA technique maintains safety while preserving model utility.

03

Safety issues are largely due to optimization, not inherent trade-offs.

Abstract

Fine-tuning language models is commonly believed to inevitably harm their safety, i.e., refusing to respond to harmful user requests, even when using harmless datasets, thus requiring additional safety measures. We challenge this belief through systematic testing, showing that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts. By properly selecting key training hyper-parameters, e.g., learning rate, batch size, and gradient steps, we reduce unsafe model responses from 16\% to approximately 5\%, as measured by keyword matching, while maintaining utility performance. Based on this observation, we propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance by creating a stable optimization path and retains the original pre-trained model's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVLSI and Analog Circuit Testing · Particle accelerators and beam dynamics · Advancements in Photolithography Techniques