Defending Jailbreak Prompts via In-Context Adversarial Game
Yujun Zhou, Yufei Han, Haomin Zhuang, Kehan Guo, Zhenwen Liang,, Hongyan Bao, Xiangliang Zhang

TL;DR
This paper introduces the In-Context Adversarial Game (ICAG), a novel method for defending large language models against jailbreak attacks without fine-tuning, using an iterative adversarial learning process.
Contribution
The paper proposes ICAG, an innovative adversarial training framework that dynamically enhances LLM defenses against jailbreaks through agent learning and iterative improvement.
Findings
ICAG significantly reduces jailbreak success rates across various attack scenarios.
ICAG demonstrates high transferability to different LLMs.
The method does not require fine-tuning, making it versatile and efficient.
Abstract
Large Language Models (LLMs) demonstrate remarkable capabilities across diverse applications. However, concerns regarding their security, particularly the vulnerability to jailbreak attacks, persist. Drawing inspiration from adversarial training in deep learning and LLM agent learning processes, we introduce the In-Context Adversarial Game (ICAG) for defending against jailbreaks without the need for fine-tuning. ICAG leverages agent learning to conduct an adversarial game, aiming to dynamically extend knowledge to defend against jailbreaks. Unlike traditional methods that rely on static datasets, ICAG employs an iterative process to enhance both the defense and attack agents. This continuous improvement process strengthens defenses against newly generated jailbreak prompts. Our empirical studies affirm ICAG's efficacy, where LLMs safeguarded by ICAG exhibit significantly reduced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Digital and Cyber Forensics
