AlignGuard: Scalable Safety Alignment for Text-to-Image Generation

Runtao Liu; I Chieh Chen; Jindong Gu; Jipeng Zhang; Renjie Pi; Qifeng Chen; Philip Torr; Ashkan Khakzar; Fabio Pizzati

arXiv:2412.10493·cs.CV·July 1, 2025

AlignGuard: Scalable Safety Alignment for Text-to-Image Generation

Runtao Liu, I Chieh Chen, Jindong Gu, Jipeng Zhang, Renjie Pi, Qifeng Chen, Philip Torr, Ashkan Khakzar, Fabio Pizzati

PDF

Open Access 2 Datasets

TL;DR

AlignGuard introduces a scalable safety alignment method for text-to-image models using synthetic datasets and expert merging, significantly reducing harmful content generation and outperforming existing safety measures.

Contribution

The paper presents a novel expert-based safety alignment approach for T2I models using DPO and a new merging strategy, enabling removal of more harmful concepts than previous methods.

Findings

01

Removes 7x more harmful concepts than baselines

02

Outperforms state-of-the-art safety benchmarks

03

Scalable safety alignment via expert merging

Abstract

Text-to-image (T2I) models are widespread, but their limited safety guardrails expose end users to harmful content and potentially allow for model misuse. Current safety measures are typically limited to text-based filtering or concept removal strategies, able to remove just a few concepts from the model's generative capabilities. In this work, we introduce AlignGuard, a method for safety alignment of T2I models. We enable the application of Direct Preference Optimization (DPO) for safety purposes in T2I models by synthetically generating a dataset of harmful and safe image-text pairs, which we call CoProV2. Using a custom DPO strategy and this dataset, we train safety experts, in the form of low-rank adaptation (LoRA) matrices, able to guide the generation process away from specific safety-related concepts. Then, we merge the experts into a single LoRA using a novel merging strategy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Handwritten Text Recognition Techniques

MethodsDirect Preference Optimization