Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large   Language Models through Carrier Articles

Zhilong Wang; Haizhou Wang; Nanqing Luo; Lan Zhang and; Xiaoyan Sun; Yebo Cao; Peng Liu

arXiv:2408.11182·cs.CR·February 10, 2025

Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Carrier Articles

Zhilong Wang, Haizhou Wang, Nanqing Luo, Lan Zhang and, Xiaoyan Sun, Yebo Cao, Peng Liu

PDF

Open Access

TL;DR

This paper introduces a novel blackbox jailbreak method for large language models that uses carrier articles to embed prohibited queries, effectively bypassing safety safeguards with a 63% success rate.

Contribution

It proposes a new attack technique leveraging self-attention insights and carrier articles to improve jailbreak success rates against LLMs.

Findings

01

Achieved an average success rate of 63% across models.

02

Outperformed existing blackbox jailbreak methods.

03

Effective in bypassing safety safeguards.

Abstract

Large Language Model (LLM) jailbreak refers to a type of attack aimed to bypass the safeguard of an LLM to generate contents that are inconsistent with the safe usage guidelines. Based on the insights from the self-attention computation process, this paper proposes a novel blackbox jailbreak approach, which involves crafting the payload prompt by strategically injecting the prohibited query into a carrier article. The carrier article maintains the semantic proximity to the prohibited query, which is automatically produced by combining a hypernymy article and a context, both of which are generated from the prohibited query. The intuition behind the usage of carrier article is to activate the neurons in the model related to the semantics of the prohibited query while suppressing the neurons that will trigger the objectionable text. Carrier article itself is benign, and we leveraged prompt…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics