Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints
Jonathan N\"other, Adish Singla, Goran Radanovi\'c

TL;DR
This paper introduces DART, a black-box text-diffusion method for red-teaming large language models, which effectively uncovers harmful behaviors by perturbing prompts while maintaining proximity to reference prompts.
Contribution
The paper presents a novel diffusion-based red-teaming approach that outperforms existing methods in finding harmful prompts close to reference inputs.
Findings
DART significantly improves harmful input discovery near reference prompts.
Established auto-regressive models underperform in proximity-constrained red-teaming.
DART outperforms fine-tuning and prompting-based methods in effectiveness.
Abstract
Recent work has proposed automated red-teaming methods for testing the vulnerabilities of a given target large language model (LLM). These methods use red-teaming LLMs to uncover inputs that induce harmful behavior in a target LLM. In this paper, we study red-teaming strategies that enable a targeted security assessment. We propose an optimization framework for red-teaming with proximity constraints, where the discovered prompts must be similar to reference prompts from a given dataset. This dataset serves as a template for the discovered prompts, anchoring the search for test-cases to specific topics, writing styles, or types of harmful behavior. We show that established auto-regressive model architectures do not perform well in this setting. We therefore introduce a black-box red-teaming method inspired by text-diffusion models: Diffusion for Auditing and Red-Teaming (DART). DART…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling
MethodsDiffusion · Difficulty-Aware Rejection Tuning
