HSF: Defending against Jailbreak Attacks with Hidden State Filtering

Cheng Qian; Hainan Zhang; Lei Sha; Zhiming Zheng

arXiv:2409.03788·cs.CR·June 10, 2025

HSF: Defending against Jailbreak Attacks with Hidden State Filtering

Cheng Qian, Hainan Zhang, Lei Sha, Zhiming Zheng

PDF

Open Access

TL;DR

This paper introduces a Hidden State Filter (HSF) mechanism that preemptively detects and rejects jailbreak prompts in LLMs by analyzing hidden state patterns, significantly improving defense effectiveness with minimal impact on benign queries.

Contribution

The paper presents a novel pre-inference defense method using hidden state analysis, outperforming existing techniques in resisting diverse jailbreak attacks with low overhead.

Findings

01

HSF significantly reduces jailbreak success rates

02

Minimal impact on benign query responses

03

Outperforms existing defense baselines

Abstract

With the growing deployment of LLMs in daily applications like chatbots and content generation, efforts to ensure outputs align with human values and avoid harmful content have intensified. However, increasingly sophisticated jailbreak attacks threaten this alignment, aiming to induce unsafe outputs. Current defense efforts either focus on prompt rewriting or detection, which are limited in effectiveness due to the various design of jailbreak prompts, or on output control and detection, which are computationally expensive as they require LLM inference. Therefore, designing a pre-inference defense method that resists diverse jailbreak prompts is crucial for preventing LLM jailbreak attacks. We observe that jailbreak attacks, safe queries, and harmful queries exhibit different clustering patterns within the LLM's hidden state representation space. This suggests that by leveraging the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Adversarial Robustness in Machine Learning · Cybercrime and Law Enforcement Studies

MethodsFocus · ALIGN