No Free Lunch for Defending Against Prefilling Attack by In-Context Learning
Zhiyu Xue, Guangliang Liu, Bocheng Chen, Kristen Marie Johnson, Ramtin, Pedarsani

TL;DR
This paper investigates the effectiveness of in-context learning (ICL) as a defense against prefilling jailbreak attacks on large language models, revealing both its strengths and limitations through extensive analysis.
Contribution
It demonstrates that adversative sentence structures in ICL can effectively defend against prefilling attacks, but also highlights the inherent limitations and over-defensiveness of this approach.
Findings
Adversative ICL structures provide robust defense across models.
Current safety alignment methods do not mitigate prefilling jailbreaks.
LLMs show over-defensiveness with adversative ICL demonstrations.
Abstract
The security of Large Language Models (LLMs) has become an important research topic since the emergence of ChatGPT. Though there have been various effective methods to defend against jailbreak attacks, prefilling attacks remain an unsolved and popular threat against open-sourced LLMs. In-Context Learning (ICL) offers a computationally efficient defense against various jailbreak attacks, yet no effective ICL methods have been developed to counter prefilling attacks. In this paper, we: (1) show that ICL can effectively defend against prefilling jailbreak attacks by employing adversative sentence structures within demonstrations; (2) characterize the effectiveness of this defense through the lens of model size, number of demonstrations, over-defense, integration with other jailbreak attacks, and the presence of safety alignment. Given the experimental results and our analysis, we conclude…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Terrorism, Counterterrorism, and Political Violence · Network Security and Intrusion Detection
