TL;DR
This paper introduces a dynamic safety shaping framework for finetuning large language models, using fine-grained safety signals to improve safety without sacrificing task performance.
Contribution
It proposes STAR-DSS, a novel method that leverages token-level safety signals from guardrail models to dynamically mitigate safety risks during finetuning.
Findings
Significant safety improvements across multiple datasets and models.
Effective mitigation of finetuning risks without loss of capabilities.
Introduction of STAR, a token-level safety trajectory assessment tool.
Abstract
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks: even a few harmful examples can compromise safety alignment. A common mitigation strategy is to update the model more strongly on examples deemed safe, while downweighting or excluding those flagged as unsafe. However, because safety context can shift within a single example, updating the model equally on both harmful and harmless parts of a response is suboptimal-a coarse treatment we term static safety shaping. In contrast, we propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content. To enable such fine-grained control during finetuning, we introduce a key insight: guardrail models, traditionally used for filtering, can be repurposed to evaluate partial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
