SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models

Mohammed Talha Alam; Nada Saadi; Fahad Shamshad; Nils Lukas; Karthik Nandakumar; Fahkri Karray; Samuele Poppi

arXiv:2511.19558·cs.CR·November 26, 2025

SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models

Mohammed Talha Alam, Nada Saadi, Fahad Shamshad, Nils Lukas, Karthik Nandakumar, Fahkri Karray, Samuele Poppi

PDF

Open Access

TL;DR

The paper introduces SPQR, a comprehensive benchmark for evaluating the robustness of safety alignment methods in text-to-image diffusion models under benign fine-tuning, highlighting frequent safety breakdowns.

Contribution

It presents SPQR, a standardized, single-score benchmark for assessing safety, utility, and robustness of T2I models post-fine-tuning, with extensive multilingual and out-of-distribution analysis.

Findings

01

Safety alignment often fails after benign fine-tuning.

02

SPQR provides a reproducible, comprehensive evaluation framework.

03

Benchmark results reveal specific failure modes across categories.

Abstract

Text-to-image diffusion models can emit copyrighted, unsafe, or private content. Safety alignment aims to suppress specific concepts, yet evaluations seldom test whether safety persists under benign downstream fine-tuning routinely applied after deployment (e.g., LoRA personalization, style/domain adapters). We study the stability of current safety methods under benign fine-tuning and observe frequent breakdowns. As true safety alignment must withstand even benign post-deployment adaptations, we introduce the SPQR benchmark (Safety-Prompt adherence-Quality-Robustness). SPQR is a single-scored metric that provides a standardized and reproducible framework to evaluate how well safety-aligned diffusion models preserve safety, utility, and robustness under benign fine-tuning, by reporting a single leaderboard score to facilitate comparisons. We conduct multilingual, domain-specific, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning