Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models
Vitali Petsiuk, Kate Saenko

TL;DR
This paper reveals how adversaries can bypass safety measures in diffusion models by using concept arithmetics and compositional inference, highlighting vulnerabilities in current safety mechanisms.
Contribution
It introduces a novel attack method leveraging concept arithmetics to reconstruct sensitive concepts, exposing potential safety flaws in diffusion models.
Findings
Proves the feasibility of concept arithmetics attacks both theoretically and empirically.
Demonstrates how multiple prompts can be combined to reconstruct target concepts.
Discusses implications for designing safer diffusion models.
Abstract
Motivated by ethical and legal concerns, the scientific community is actively developing methods to limit the misuse of Text-to-Image diffusion models for reproducing copyrighted, violent, explicit, or personal information in the generated images. Simultaneously, researchers put these newly developed safety measures to the test by assuming the role of an adversary to find vulnerabilities and backdoors in them. We use compositional property of diffusion models, which allows to leverage multiple prompts in a single image generation. This property allows us to combine other concepts, that should not have been affected by the inhibition, to reconstruct the vector, responsible for target concept generation, even though the direct computation of this vector is no longer accessible. We provide theoretical and empirical evidence why the proposed attacks are possible and discuss the implications…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference
MethodsDiffusion
