Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting

Yifan Luo; Zhennan Zhou; Meitan Wang; Bin Dong

arXiv:2410.10150·cs.CL·October 15, 2024

Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting

Yifan Luo, Zhennan Zhou, Meitan Wang, Bin Dong

PDF

Open Access

TL;DR

This paper reveals that re-weighting MLP neurons in instruction-tuned LLMs can undermine safety, leading to new white-box jailbreak methods that effectively bypass safety measures across various models.

Contribution

We introduce two novel white-box jailbreak techniques targeting MLP layers, enhancing understanding of LLM safety vulnerabilities and internal mechanisms.

Findings

01

Re-weighting MLP neurons compromises model safety.

02

Proposed jailbreak methods outperform existing techniques.

03

Vulnerabilities are consistent across models from 2B to 72B size.

Abstract

In this paper, we investigate the safety mechanisms of instruction fine-tuned large language models (LLMs). We discover that re-weighting MLP neurons can significantly compromise a model's safety, especially for MLPs in end-of-sentence inferences. We hypothesize that LLMs evaluate the harmfulness of prompts during end-of-sentence inferences, and MLP layers plays a critical role in this process. Based on this hypothesis, we develop 2 novel white-box jailbreak methods: a prompt-specific method and a prompt-general method. The prompt-specific method targets individual prompts and optimizes the attack on the fly, while the prompt-general method is pre-trained offline and can generalize to unseen harmful prompts. Our methods demonstrate robust performance across 7 popular open-source LLMs, size ranging from 2B to 72B. Furthermore, our study provides insights into vulnerabilities of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Digital and Cyber Forensics · Artificial Intelligence in Law