Shape it Up! Restoring LLM Safety during Finetuning

ShengYun Peng; Pin-Yu Chen; Jianfeng Chi; Seongmin Lee; Duen Horng Chau

arXiv:2505.17196·cs.LG·December 23, 2025

Shape it Up! Restoring LLM Safety during Finetuning

ShengYun Peng, Pin-Yu Chen, Jianfeng Chi, Seongmin Lee, Duen Horng Chau

PDF

1 Video

TL;DR

This paper introduces a dynamic safety shaping framework for finetuning large language models, using fine-grained safety signals to improve safety without sacrificing task performance.

Contribution

It proposes STAR-DSS, a novel method that leverages token-level safety signals from guardrail models to dynamically mitigate safety risks during finetuning.

Findings

01

Significant safety improvements across multiple datasets and models.

02

Effective mitigation of finetuning risks without loss of capabilities.

03

Introduction of STAR, a token-level safety trajectory assessment tool.

Abstract

Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks: even a few harmful examples can compromise safety alignment. A common mitigation strategy is to update the model more strongly on examples deemed safe, while downweighting or excluding those flagged as unsafe. However, because safety context can shift within a single example, updating the model equally on both harmful and harmless parts of a response is suboptimal-a coarse treatment we term static safety shaping. In contrast, we propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content. To enable such fine-grained control during finetuning, we introduce a key insight: guardrail models, traditionally used for filtering, can be repurposed to evaluate partial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Shape it Up! Restoring LLM Safety during Finetuning· slideslive