Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs
Jinhwa Kim, Ian G. Harris

TL;DR
This paper introduces a Context Filtering model that enhances LLM safety against jailbreak attacks by filtering malicious context, significantly reducing attack success rates while preserving model helpfulness and being applicable to all LLMs.
Contribution
The paper presents a novel, plug-and-play Context Filtering approach that improves LLM safety without fine-tuning, outperforming existing defenses against multiple jailbreak attacks.
Findings
Reduces jailbreak attack success rates by up to 88%.
Maintains original LLM helpfulness and performance.
Applicable to both white-box and black-box LLMs.
Abstract
While Large Language Models (LLMs) have shown significant advancements in performance, various jailbreak attacks have posed growing safety and ethical risks. Malicious users often exploit adversarial context to deceive LLMs, prompting them to generate responses to harmful queries. In this study, we propose a new defense mechanism called Context Filtering model, an input pre-processing method designed to filter out untrustworthy and unreliable context while identifying the primary prompts containing the real user intent to uncover concealed malicious intent. Given that enhancing the safety of LLMs often compromises their helpfulness, potentially affecting the experience of benign users, our method aims to improve the safety of the LLMs while preserving their original performance. We evaluate the effectiveness of our model in defending against jailbreak attacks through comparative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
