SafeCFG: Controlling Harmful Features with Dynamic Safe Guidance for Safe Generation
Jiadong Pan, Liang Li, Hongcheng Gao, Zheng-Jun Zha, Qingming Huang, Jiebo Luo

TL;DR
SafeCFG is a novel method that adaptively controls harmful features during diffusion model image generation, balancing safety and quality without requiring pre-labeled data.
Contribution
It introduces a dynamic safe guidance mechanism for diffusion models that enhances safety while maintaining high image quality, and enables unsupervised harmfulness detection.
Findings
SafeCFG achieves high-quality, safe image generation.
It enables unsupervised harmfulness detection.
SafeCFG maintains performance without pre-labeled data.
Abstract
Diffusion models (DMs) have demonstrated exceptional performance in text-to-image tasks, leading to their widespread use. With the introduction of classifier-free guidance (CFG), the quality of images generated by DMs is significantly improved. However, one can use DMs to generate more harmful images by maliciously guiding the image generation process through CFG. Existing safe alignment methods aim to mitigate the risk of generating harmful images but often reduce the quality of clean image generation. To address this issue, we propose SafeCFG to adaptively control harmful features with dynamic safe guidance by modulating the CFG generation process. It dynamically guides the CFG generation process based on the harmfulness of the prompts, inducing significant deviations only in harmful CFG generations, achieving high quality and safety generation. SafeCFG can simultaneously modulate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Adversarial Robustness in Machine Learning · Advanced Malware Detection Techniques
MethodsDiffusion
