Defending Jailbreak Prompts via In-Context Adversarial Game

Yujun Zhou; Yufei Han; Haomin Zhuang; Kehan Guo; Zhenwen Liang,; Hongyan Bao; Xiangliang Zhang

arXiv:2402.13148·cs.LG·February 25, 2025·3 cites

Defending Jailbreak Prompts via In-Context Adversarial Game

Yujun Zhou, Yufei Han, Haomin Zhuang, Kehan Guo, Zhenwen Liang,, Hongyan Bao, Xiangliang Zhang

PDF

Open Access 1 Video

TL;DR

This paper introduces the In-Context Adversarial Game (ICAG), a novel method for defending large language models against jailbreak attacks without fine-tuning, using an iterative adversarial learning process.

Contribution

The paper proposes ICAG, an innovative adversarial training framework that dynamically enhances LLM defenses against jailbreaks through agent learning and iterative improvement.

Findings

01

ICAG significantly reduces jailbreak success rates across various attack scenarios.

02

ICAG demonstrates high transferability to different LLMs.

03

The method does not require fine-tuning, making it versatile and efficient.

Abstract

Large Language Models (LLMs) demonstrate remarkable capabilities across diverse applications. However, concerns regarding their security, particularly the vulnerability to jailbreak attacks, persist. Drawing inspiration from adversarial training in deep learning and LLM agent learning processes, we introduce the In-Context Adversarial Game (ICAG) for defending against jailbreaks without the need for fine-tuning. ICAG leverages agent learning to conduct an adversarial game, aiming to dynamically extend knowledge to defend against jailbreaks. Unlike traditional methods that rely on static datasets, ICAG employs an iterative process to enhance both the defense and attack agents. This continuous improvement process strengthens defenses against newly generated jailbreak prompts. Our empirical studies affirm ICAG's efficacy, where LLMs safeguarded by ICAG exhibit significantly reduced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Defending Jailbreak Prompts via In-Context Adversarial Game· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Digital and Cyber Forensics