HSF: Defending against Jailbreak Attacks with Hidden State Filtering
Cheng Qian, Hainan Zhang, Lei Sha, Zhiming Zheng

TL;DR
This paper introduces a Hidden State Filter (HSF) mechanism that preemptively detects and rejects jailbreak prompts in LLMs by analyzing hidden state patterns, significantly improving defense effectiveness with minimal impact on benign queries.
Contribution
The paper presents a novel pre-inference defense method using hidden state analysis, outperforming existing techniques in resisting diverse jailbreak attacks with low overhead.
Findings
HSF significantly reduces jailbreak success rates
Minimal impact on benign query responses
Outperforms existing defense baselines
Abstract
With the growing deployment of LLMs in daily applications like chatbots and content generation, efforts to ensure outputs align with human values and avoid harmful content have intensified. However, increasingly sophisticated jailbreak attacks threaten this alignment, aiming to induce unsafe outputs. Current defense efforts either focus on prompt rewriting or detection, which are limited in effectiveness due to the various design of jailbreak prompts, or on output control and detection, which are computationally expensive as they require LLM inference. Therefore, designing a pre-inference defense method that resists diverse jailbreak prompts is crucial for preventing LLM jailbreak attacks. We observe that jailbreak attacks, safe queries, and harmful queries exhibit different clustering patterns within the LLM's hidden state representation space. This suggests that by leveraging the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics · Adversarial Robustness in Machine Learning · Cybercrime and Law Enforcement Studies
MethodsFocus · ALIGN
