When There Is No Decoder: Removing Watermarks from Stable Diffusion Models in a No-box Setting
Xiaodong Wu, Tianyi Tang, Xiangman Li, Jianbing Ni, and Yong Yu

TL;DR
This paper evaluates the robustness of model-specific watermarks in text-to-image models like latent diffusion, revealing vulnerabilities to blurring and fine-tuning attacks in a no-box setting, and assessing defenses.
Contribution
It introduces new attack strategies against watermarking in a no-box setting and analyzes factors affecting attack success, highlighting limitations of current defenses.
Findings
Watermarks are vulnerable to blurring and fine-tuning attacks.
Best attack reduces detection accuracy to about 48%.
Robust defenses like multi-label smoothing are still ineffective under attack.
Abstract
Watermarking has emerged as a promising solution to counter harmful or deceptive AI-generated content by embedding hidden identifiers that trace content origins. However, the robustness of current watermarking techniques is still largely unexplored, raising critical questions about their effectiveness against adversarial attacks. To address this gap, we examine the robustness of model-specific watermarking, where watermark embedding is integrated with text-to-image generation in models like latent diffusion models. We introduce three attack strategies: edge prediction-based, box blurring, and fine-tuning-based attacks in a no-box setting, where an attacker does not require access to the ground-truth watermark decoder. Our findings reveal that while model-specific watermarking is resilient against basic evasion attempts, such as edge prediction, it is notably vulnerable to blurring and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
