Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models
ShengYun Peng, Pin-Yu Chen, Matthew Hull, Duen Horng Chau

TL;DR
This paper introduces the concept of a 'safety basin' in LLMs, demonstrating how safety is maintained within a local parameter space and proposing a new safety metric, VISAGE, to evaluate risks during finetuning.
Contribution
The study uncovers the universal safety basin phenomenon in open-source LLMs and develops the VISAGE safety metric to measure safety landscape shifts during finetuning.
Findings
Safety is preserved within a local parameter region called the safety basin.
Outside the safety basin, safety drops sharply, indicating risk of unsafe behavior.
System prompts play a crucial role in maintaining safety within the basin.
Abstract
Safety alignment is crucial to ensure that large language models (LLMs) behave in ways that align with human preferences and prevent harmful actions during inference. However, recent studies show that the alignment can be easily compromised through finetuning with only a few adversarially designed training examples. We aim to measure the risks in finetuning LLMs through navigating the LLM safety landscape. We discover a new phenomenon observed universally in the model parameter space of popular open-source LLMs, termed as "safety basin": random perturbations to model weights maintain the safety level of the original aligned model within its local neighborhood. However, outside this local region, safety is fully compromised, exhibiting a sharp, step-like drop. This safety basin contrasts sharply with the LLM capability landscape, where model performance peaks at the origin and gradually…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling
MethodsALIGN
