Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Carrier Articles
Zhilong Wang, Haizhou Wang, Nanqing Luo, Lan Zhang and, Xiaoyan Sun, Yebo Cao, Peng Liu

TL;DR
This paper introduces a novel blackbox jailbreak method for large language models that uses carrier articles to embed prohibited queries, effectively bypassing safety safeguards with a 63% success rate.
Contribution
It proposes a new attack technique leveraging self-attention insights and carrier articles to improve jailbreak success rates against LLMs.
Findings
Achieved an average success rate of 63% across models.
Outperformed existing blackbox jailbreak methods.
Effective in bypassing safety safeguards.
Abstract
Large Language Model (LLM) jailbreak refers to a type of attack aimed to bypass the safeguard of an LLM to generate contents that are inconsistent with the safe usage guidelines. Based on the insights from the self-attention computation process, this paper proposes a novel blackbox jailbreak approach, which involves crafting the payload prompt by strategically injecting the prohibited query into a carrier article. The carrier article maintains the semantic proximity to the prohibited query, which is automatically produced by combining a hypernymy article and a context, both of which are generated from the prohibited query. The intuition behind the usage of carrier article is to activate the neurons in the model related to the semantics of the prohibited query while suppressing the neurons that will trigger the objectionable text. Carrier article itself is benign, and we leveraged prompt…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics
