TokenProber: Jailbreaking Text-to-image Models via Fine-grained Word Impact Analysis
Longtian Wang, Xiaofei Xie, Tianlin Li, Yuhan Zhi, Chao Shen

TL;DR
TokenProber is a novel method that analyzes word impacts in prompts to generate adversarial inputs, effectively testing and exposing vulnerabilities in safety mechanisms of text-to-image models.
Contribution
It introduces a sensitivity-aware differential testing approach that identifies how specific words influence safety checkers and T2I models, revealing robustness weaknesses.
Findings
TokenProber achieves over 54% improvement in bypass success rate.
It uncovers significant robustness issues in existing safety checkers.
The method effectively balances NSFW content preservation with evasion of safety filters.
Abstract
Text-to-image (T2I) models have significantly advanced in producing high-quality images. However, such models have the ability to generate images containing not-safe-for-work (NSFW) content, such as pornography, violence, political content, and discrimination. To mitigate the risk of generating NSFW content, refusal mechanisms, i.e., safety checkers, have been developed to check potential NSFW content. Adversarial prompting techniques have been developed to evaluate the robustness of the refusal mechanisms. The key challenge remains to subtly modify the prompt in a way that preserves its sensitive nature while bypassing the refusal mechanisms. In this paper, we introduce TokenProber, a method designed for sensitivity-aware differential testing, aimed at evaluating the robustness of the refusal mechanisms in T2I models by generating adversarial prompts. Our approach is based on the key…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection · Digital and Cyber Forensics · Law in Society and Culture
