Antelope: Potent and Concealed Jailbreak Attack Strategy
Xin Zhao, Xiaojun Chen, Haoyu Gao

TL;DR
Antelope is a novel, covert jailbreak attack strategy that exploits semantic confusion and transferability to bypass security filters in generative models, effectively generating NSFW content despite safeguards.
Contribution
The paper introduces Antelope, a robust and covert attack method that improves search efficiency and attack stealthiness by leveraging semantic concept confusion and transferability.
Findings
Outperforms existing attack baselines across multiple defenses
Effectively generates NSFW content while evading detection
Successfully penetrates online black-box services
Abstract
Due to the remarkable generative potential of diffusion-based models, numerous researches have investigated jailbreak attacks targeting these frameworks. A particularly concerning threat within image models is the generation of Not-Safe-for-Work (NSFW) content. Despite the implementation of security filters, numerous efforts continue to explore ways to circumvent these safeguards. Current attack methodologies primarily encompass adversarial prompt engineering or concept obfuscation, yet they frequently suffer from slow search efficiency, conspicuous attack characteristics and poor alignment with targets. To overcome these challenges, we propose Antelope, a more robust and covert jailbreak attack strategy designed to expose security vulnerabilities inherent in generative models. Specifically, Antelope leverages the confusion of sensitive concepts with similar ones, facilitates searches…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCybercrime and Law Enforcement Studies · Digital and Cyber Forensics · Terrorism, Counterterrorism, and Political Violence
