MacPrompt: Maraconic-guided Jailbreak against Text-to-Image Models
Xi Ye, Yiwen Liu, Lina Wang, Run Wang, Geying Yang, Yufei Hou, Jiayi Yu

TL;DR
MacPrompt introduces a novel cross-lingual, macaronic adversarial prompt technique that exposes vulnerabilities in text-to-image safety filters, achieving high success rates in bypassing defenses and highlighting the need for more robust safety mechanisms.
Contribution
The paper presents MacPrompt, a new black-box, cross-lingual attack method that constructs fine-grained, semantically similar adversarial prompts to effectively bypass T2I safety filters.
Findings
Achieves up to 92% success in bypassing safety filters for harmful content.
Constructs prompts with up to 0.96 semantic similarity to original harmful inputs.
Breaks state-of-the-art concept removal defenses with high success rates.
Abstract
Text-to-image (T2I) models have raised increasing safety concerns due to their capacity to generate NSFW and other banned objects. To mitigate these risks, safety filters and concept removal techniques have been introduced to block inappropriate prompts or erase sensitive concepts from the models. However, all the existing defense methods are not well prepared to handle diverse adversarial prompts. In this work, we introduce MacPrompt, a novel black-box and cross-lingual attack that reveals previously overlooked vulnerabilities in T2I safety mechanisms. Unlike existing attacks that rely on synonym substitution or prompt obfuscation, MacPrompt constructs macaronic adversarial prompts by performing cross-lingual character-level recombination of harmful terms, enabling fine-grained control over both semantics and appearance. By leveraging this design, MacPrompt crafts prompts with high…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Generative Adversarial Networks and Image Synthesis
