Navigating the Safety Landscape: Measuring Risks in Finetuning Large   Language Models

ShengYun Peng; Pin-Yu Chen; Matthew Hull; Duen Horng Chau

arXiv:2405.17374·cs.LG·November 1, 2024·2 cites

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

ShengYun Peng, Pin-Yu Chen, Matthew Hull, Duen Horng Chau

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces the concept of a 'safety basin' in LLMs, demonstrating how safety is maintained within a local parameter space and proposing a new safety metric, VISAGE, to evaluate risks during finetuning.

Contribution

The study uncovers the universal safety basin phenomenon in open-source LLMs and develops the VISAGE safety metric to measure safety landscape shifts during finetuning.

Findings

01

Safety is preserved within a local parameter region called the safety basin.

02

Outside the safety basin, safety drops sharply, indicating risk of unsafe behavior.

03

System prompts play a crucial role in maintaining safety within the basin.

Abstract

Safety alignment is crucial to ensure that large language models (LLMs) behave in ways that align with human preferences and prevent harmful actions during inference. However, recent studies show that the alignment can be easily compromised through finetuning with only a few adversarially designed training examples. We aim to measure the risks in finetuning LLMs through navigating the LLM safety landscape. We discover a new phenomenon observed universally in the model parameter space of popular open-source LLMs, termed as "safety basin": random perturbations to model weights maintain the safety level of the original aligned model within its local neighborhood. However, outside this local region, safety is fully compromised, exhibiting a sharp, step-like drop. This safety basin contrasts sharply with the LLM capability landscape, where model performance peaks at the origin and gradually…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shengyun-peng/llm-landscape
pytorchOfficial

Videos

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models· slideslive

Taxonomy

TopicsTopic Modeling

MethodsALIGN