Learning diverse attacks on large language models for robust red-teaming   and safety tuning

Seanie Lee; Minsu Kim; Lynn Cherif; David Dobre; Juho Lee; Sung Ju; Hwang; Kenji Kawaguchi; Gauthier Gidel; Yoshua Bengio; Nikolay Malkin; Moksh; Jain

arXiv:2405.18540·cs.CL·March 3, 2025

Learning diverse attacks on large language models for robust red-teaming and safety tuning

Seanie Lee, Minsu Kim, Lynn Cherif, David Dobre, Juho Lee, Sung Ju, Hwang, Kenji Kawaguchi, Gauthier Gidel, Yoshua Bengio, Nikolay Malkin, Moksh, Jain

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel GFlowNet-based approach for automated red-teaming of large language models, generating diverse and effective attack prompts to improve safety and robustness against harmful responses.

Contribution

It proposes a GFlowNet fine-tuning method for red-teaming that overcomes mode collapse and enhances attack diversity and effectiveness compared to existing reinforcement learning approaches.

Findings

01

GFlowNet-based attacks are effective against various LLMs.

02

Generated prompts transfer well across different models.

03

Safety-tuning with these prompts improves model robustness.

Abstract

Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

GFNOrg/red-teaming
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Security and Intrusion Detection · Topic Modeling · Web Application Security Vulnerabilities