Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting
Yifan Luo, Zhennan Zhou, Meitan Wang, Bin Dong

TL;DR
This paper reveals that re-weighting MLP neurons in instruction-tuned LLMs can undermine safety, leading to new white-box jailbreak methods that effectively bypass safety measures across various models.
Contribution
We introduce two novel white-box jailbreak techniques targeting MLP layers, enhancing understanding of LLM safety vulnerabilities and internal mechanisms.
Findings
Re-weighting MLP neurons compromises model safety.
Proposed jailbreak methods outperform existing techniques.
Vulnerabilities are consistent across models from 2B to 72B size.
Abstract
In this paper, we investigate the safety mechanisms of instruction fine-tuned large language models (LLMs). We discover that re-weighting MLP neurons can significantly compromise a model's safety, especially for MLPs in end-of-sentence inferences. We hypothesize that LLMs evaluate the harmfulness of prompts during end-of-sentence inferences, and MLP layers plays a critical role in this process. Based on this hypothesis, we develop 2 novel white-box jailbreak methods: a prompt-specific method and a prompt-general method. The prompt-specific method targets individual prompts and optimizes the attack on the fly, while the prompt-general method is pre-trained offline and can generalize to unseen harmful prompts. Our methods demonstrate robust performance across 7 popular open-source LLMs, size ranging from 2B to 72B. Furthermore, our study provides insights into vulnerabilities of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Digital and Cyber Forensics · Artificial Intelligence in Law
