Purple-teaming LLMs with Adversarial Defender Training

Jingyan Zhou; Kun Li; Junan Li; Jiawen Kang; Minda Hu; Xixin Wu; Helen; Meng

arXiv:2407.01850·cs.CL·July 3, 2024

Purple-teaming LLMs with Adversarial Defender Training

Jingyan Zhou, Kun Li, Junan Li, Jiawen Kang, Minda Hu, Xixin Wu, Helen, Meng

PDF

Open Access

TL;DR

This paper introduces PAD, a novel adversarial training pipeline combining red and blue teaming to improve LLM safety by actively identifying vulnerabilities and enhancing safe response generation.

Contribution

The paper presents a self-play adversarial training framework for LLM safety, integrating attack and defense modules in a generative adversarial manner to improve vulnerability detection and safety.

Findings

01

PAD outperforms baselines in attack effectiveness and safety robustness

02

It balances safety with overall model quality effectively

03

Identifies challenges like multi-turn attack defense and risk detection strategies

Abstract

Existing efforts in safeguarding LLMs are limited in actively exposing the vulnerabilities of the target LLM and readily adapting to newly emerging safety risks. To address this, we present Purple-teaming LLMs with Adversarial Defender training (PAD), a pipeline designed to safeguard LLMs by novelly incorporating the red-teaming (attack) and blue-teaming (safety training) techniques. In PAD, we automatically collect conversational data that cover the vulnerabilities of an LLM around specific safety risks in a self-play manner, where the attacker aims to elicit unsafe responses and the defender generates safe responses to these attacks. We then update both modules in a generative adversarial network style by training the attacker to elicit more unsafe responses and updating the defender to identify them and explain the unsafe reason. Experimental results demonstrate that PAD…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImbalanced Data Classification Techniques · Digital Rights Management and Security · Advanced Malware Detection Techniques