Defending against Jailbreak through Early Exit Generation of Large Language Models
Chongwen Zhao, Zhihao Dou, Kaizhu Huang

TL;DR
This paper introduces EEG-Defender, a novel early exit detection method for large language models that effectively reduces jailbreak success rates by identifying malicious prompts early in the generation process.
Contribution
The paper proposes leveraging early transformer outputs to detect malicious prompts, significantly improving jailbreak mitigation with minimal utility loss.
Findings
EEG-Defender reduces attack success rate by approximately 85%.
It outperforms state-of-the-art methods, which achieve around 50% success rate.
The approach maintains high utility of LLMs while enhancing security.
Abstract
Large Language Models (LLMs) are increasingly attracting attention in various applications. Nonetheless, there is a growing concern as some users attempt to exploit these models for malicious purposes, including the synthesis of controlled substances and the propagation of disinformation. In an effort to mitigate such risks, the concept of "Alignment" technology has been developed. However, recent studies indicate that this alignment can be undermined using sophisticated prompt engineering or adversarial suffixes, a technique known as "Jailbreak." Our research takes cues from the human-like generate process of LLMs. We identify that while jailbreaking prompts may yield output logits similar to benign prompts, their initial embeddings within the model's latent space tend to be more analogous to those of malicious prompts. Leveraging this finding, we propose utilizing the early…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDeception detection and forensic psychology · European Criminal Justice and Data Protection
MethodsSoftmax · Attention Is All You Need
