PLA: Prompt Learning Attack against Text-to-Image Generative Models
Xinqi Lyu, Yihao Liu, Yanjie Li, and Bin Xiao

TL;DR
This paper introduces PLA, a novel prompt learning attack framework that effectively bypasses safety mechanisms in black-box text-to-image models using gradient-based training with multimodal similarities.
Contribution
The paper presents a new prompt learning attack method tailored for black-box T2I models, overcoming limitations of previous word substitution approaches.
Findings
PLA achieves higher attack success rates than existing methods.
The framework effectively bypasses prompt filters and safety checkers.
Gradient-based training with multimodal similarities enhances attack performance.
Abstract
Text-to-Image (T2I) models have gained widespread adoption across various applications. Despite the success, the potential misuse of T2I models poses significant risks of generating Not-Safe-For-Work (NSFW) content. To investigate the vulnerability of T2I models, this paper delves into adversarial attacks to bypass the safety mechanisms under black-box settings. Most previous methods rely on word substitution to search adversarial prompts. Due to limited search space, this leads to suboptimal performance compared to gradient-based training. However, black-box settings present unique challenges to training gradient-driven attack methods, since there is no access to the internal architecture and parameters of T2I models. To facilitate the learning of adversarial prompts in black-box settings, we propose a novel prompt learning attack framework (PLA), where insightful gradient-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
