Defending against Jailbreak through Early Exit Generation of Large Language Models

Chongwen Zhao; Zhihao Dou; Kaizhu Huang

arXiv:2408.11308·cs.AI·August 26, 2025

Defending against Jailbreak through Early Exit Generation of Large Language Models

Chongwen Zhao, Zhihao Dou, Kaizhu Huang

PDF

Open Access

TL;DR

This paper introduces EEG-Defender, a novel early exit detection method for large language models that effectively reduces jailbreak success rates by identifying malicious prompts early in the generation process.

Contribution

The paper proposes leveraging early transformer outputs to detect malicious prompts, significantly improving jailbreak mitigation with minimal utility loss.

Findings

01

EEG-Defender reduces attack success rate by approximately 85%.

02

It outperforms state-of-the-art methods, which achieve around 50% success rate.

03

The approach maintains high utility of LLMs while enhancing security.

Abstract

Large Language Models (LLMs) are increasingly attracting attention in various applications. Nonetheless, there is a growing concern as some users attempt to exploit these models for malicious purposes, including the synthesis of controlled substances and the propagation of disinformation. In an effort to mitigate such risks, the concept of "Alignment" technology has been developed. However, recent studies indicate that this alignment can be undermined using sophisticated prompt engineering or adversarial suffixes, a technique known as "Jailbreak." Our research takes cues from the human-like generate process of LLMs. We identify that while jailbreaking prompts may yield output logits similar to benign prompts, their initial embeddings within the model's latent space tend to be more analogous to those of malicious prompts. Leveraging this finding, we propose utilizing the early…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDeception detection and forensic psychology · European Criminal Justice and Data Protection

MethodsSoftmax · Attention Is All You Need