Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful   Behaviors with Proximity Constraints

Jonathan N\"other; Adish Singla; Goran Radanovi\'c

arXiv:2501.08246·cs.LG·January 15, 2025

Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints

Jonathan N\"other, Adish Singla, Goran Radanovi\'c

PDF

Open Access 1 Video

TL;DR

This paper introduces DART, a black-box text-diffusion method for red-teaming large language models, which effectively uncovers harmful behaviors by perturbing prompts while maintaining proximity to reference prompts.

Contribution

The paper presents a novel diffusion-based red-teaming approach that outperforms existing methods in finding harmful prompts close to reference inputs.

Findings

01

DART significantly improves harmful input discovery near reference prompts.

02

Established auto-regressive models underperform in proximity-constrained red-teaming.

03

DART outperforms fine-tuning and prompting-based methods in effectiveness.

Abstract

Recent work has proposed automated red-teaming methods for testing the vulnerabilities of a given target large language model (LLM). These methods use red-teaming LLMs to uncover inputs that induce harmful behavior in a target LLM. In this paper, we study red-teaming strategies that enable a targeted security assessment. We propose an optimization framework for red-teaming with proximity constraints, where the discovered prompts must be similar to reference prompts from a given dataset. This dataset serves as a template for the discovered prompts, anchoring the search for test-cases to specific topics, writing styles, or types of harmful behavior. We show that established auto-regressive model architectures do not perform well in this setting. We therefore introduce a black-box red-teaming method inspired by text-diffusion models: Diffusion for Auditing and Red-Teaming (DART). DART…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints· underline

Taxonomy

TopicsTopic Modeling

MethodsDiffusion · Difficulty-Aware Rejection Tuning