TrojDiff: Trojan Attacks on Diffusion Models with Diverse Targets
Weixin Chen, Dawn Song, Bo Li

TL;DR
This paper introduces TrojDiff, a novel Trojan attack method on diffusion models that manipulates training to produce targeted malicious outputs, demonstrating high attack success while maintaining model performance in benign settings.
Contribution
The paper proposes TrojDiff, a new Trojan attack framework for diffusion models that optimizes adversarial diffusion and generative processes during training.
Findings
TrojDiff achieves high attack success rates across different targets and triggers.
The attack preserves the performance of diffusion models in benign environments.
TrojDiff is effective on CIFAR-10 and CelebA datasets against multiple diffusion model types.
Abstract
Diffusion models have achieved great success in a range of tasks, such as image synthesis and molecule design. As such successes hinge on large-scale training data collected from diverse sources, the trustworthiness of these collected data is hard to control or audit. In this work, we aim to explore the vulnerabilities of diffusion models under potential training data manipulations and try to answer: How hard is it to perform Trojan attacks on well-trained diffusion models? What are the adversarial targets that such Trojan attacks can achieve? To answer these questions, we propose an effective Trojan attack against diffusion models, TrojDiff, which optimizes the Trojan diffusion and generative processes during training. In particular, we design novel transitions during the Trojan diffusion process to diffuse adversarial targets into a biased Gaussian distribution and propose a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Adversarial Robustness in Machine Learning
MethodsDiffusion
