Safety Alignment Backfires: Preventing the Re-emergence of Suppressed   Concepts in Fine-tuned Text-to-Image Diffusion Models

Sanghyun Kim; Moonseok Choi; Jinwoo Shin; Juho Lee

arXiv:2412.00357·cs.AI·December 3, 2024

Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models

Sanghyun Kim, Moonseok Choi, Jinwoo Shin, Juho Lee

PDF

Open Access

TL;DR

This paper uncovers how safety measures in fine-tuned text-to-image models can fail, leading to re-emergence of harmful content, and proposes Modular LoRA as a solution to maintain safety without sacrificing performance.

Contribution

It identifies the safety breakdown issue during fine-tuning and introduces Modular LoRA, a novel method to prevent re-learning harmful content in diffusion models.

Findings

01

Modular LoRA effectively prevents re-learning of harmful content.

02

Traditional fine-tuning can undo safety measures, causing harmful concepts to reappear.

03

Modular LoRA maintains safety without reducing model performance.

Abstract

Fine-tuning text-to-image diffusion models is widely used for personalization and adaptation for new domains. In this paper, we identify a critical vulnerability of fine-tuning: safety alignment methods designed to filter harmful content (e.g., nudity) can break down during fine-tuning, allowing previously suppressed content to resurface, even when using benign datasets. While this "fine-tuning jailbreaking" issue is known in large language models, it remains largely unexplored in text-to-image diffusion models. Our investigation reveals that standard fine-tuning can inadvertently undo safety measures, causing models to relearn harmful concepts that were previously removed and even exacerbate harmful behaviors. To address this issue, we present a novel but immediate solution called Modular LoRA, which involves training Safety Low-Rank Adaptation (LoRA) modules separately from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsDiffusion