Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets

Lei Hsiung; Tianyu Pang; Yung-Chen Tang; Linyue Song; Tsung-Yi Ho; Pin-Yu Chen; Yaoqing Yang

arXiv:2506.05346·cs.CR·June 6, 2025

Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets

Lei Hsiung, Tianyu Pang, Yung-Chen Tang, Linyue Song, Tsung-Yi Ho, Pin-Yu Chen, Yaoqing Yang

PDF

Open Access

TL;DR

This paper reveals that high similarity between safety-alignment data and fine-tuning datasets weakens LLM safety guardrails, increasing jailbreak risks, and suggests designing less similar datasets for more robust models.

Contribution

It introduces a novel analysis of dataset similarity's impact on LLM safety, emphasizing upstream dataset design to enhance guardrail robustness.

Findings

01

High similarity weakens safety guardrails

02

Low similarity improves model robustness

03

Reduces harmfulness score by up to 10.33%

Abstract

Recent advancements in large language models (LLMs) have underscored their vulnerability to safety alignment jailbreaks, particularly when subjected to downstream fine-tuning. However, existing mitigation strategies primarily focus on reactively addressing jailbreak incidents after safety guardrails have been compromised, removing harmful gradients during fine-tuning, or continuously reinforcing safety alignment throughout fine-tuning. As such, they tend to overlook a critical upstream factor: the role of the original safety-alignment data. This paper therefore investigates the degradation of safety guardrails through the lens of representation similarity between upstream alignment datasets and downstream fine-tuning tasks. Our experiments demonstrate that high similarity between these datasets significantly weakens safety guardrails, making models more susceptible to jailbreaks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Information and Cyber Security · Topic Modeling

Methodstravel james · Focus