Learning diverse attacks on large language models for robust red-teaming and safety tuning
Seanie Lee, Minsu Kim, Lynn Cherif, David Dobre, Juho Lee, Sung Ju, Hwang, Kenji Kawaguchi, Gauthier Gidel, Yoshua Bengio, Nikolay Malkin, Moksh, Jain

TL;DR
This paper introduces a novel GFlowNet-based approach for automated red-teaming of large language models, generating diverse and effective attack prompts to improve safety and robustness against harmful responses.
Contribution
It proposes a GFlowNet fine-tuning method for red-teaming that overcomes mode collapse and enhances attack diversity and effectiveness compared to existing reinforcement learning approaches.
Findings
GFlowNet-based attacks are effective against various LLMs.
Generated prompts transfer well across different models.
Safety-tuning with these prompts improves model robustness.
Abstract
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Security and Intrusion Detection · Topic Modeling · Web Application Security Vulnerabilities
