FLIRT: Feedback Loop In-context Red Teaming

Ninareh Mehrabi; Palash Goyal; Christophe Dupuy; Qian Hu; Shalini; Ghosh; Richard Zemel; Kai-Wei Chang; Aram Galstyan; Rahul Gupta

arXiv:2308.04265·cs.AI·November 11, 2024·2 cites

FLIRT: Feedback Loop In-context Red Teaming

Ninareh Mehrabi, Palash Goyal, Christophe Dupuy, Qian Hu, Shalini, Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces FLIRT, an automatic feedback loop-based red teaming framework that tests the robustness of generative models, including text-to-image and text-to-text models, by exposing vulnerabilities to unsafe content generation.

Contribution

We propose a novel in-context learning feedback loop for automatic red teaming, effectively uncovering vulnerabilities in black-box generative models against unsafe outputs.

Findings

01

Stable Diffusion models are vulnerable despite safety features

02

The framework effectively red teams text-to-text models

03

Adversarial prompts can trigger unsafe content generation

Abstract

Warning: this paper contains content that may be inappropriate or offensive. As generative models become available for public use in various applications, testing and analyzing vulnerabilities of these models has become a priority. In this work, we propose an automatic red teaming framework that evaluates a given black-box model and exposes its vulnerabilities against unsafe and inappropriate content generation. Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation. In particular, taking text-to-image models as target models, we explore different feedback mechanisms to automatically learn effective and diverse adversarial prompts. Our experiments demonstrate that even with enhanced safety features, Stable Diffusion (SD) models are vulnerable to our adversarial prompts, raising concerns on their robustness in…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- Automated red teaming is a timely and important problem, and there have been relatively few papers focusing on text-to-image red teaming - The few-shot method from Perez et al. was an interesting approach, and I'm glad to see more exploration of this type of method - There are many different variations of in-context red teaming explored in this paper, which could be helpful to future papers seeking to explore this space further - The results are strong

Weaknesses

- It would be good to have more baselines. E.g., methods like PEZ have also been evaluated primarily on text-to-image models, and some concurrent work from Google DeepMind would be good to compare to: https://arxiv.org/abs/2309.03409. The limited comparison to other methods is the main reason why I'm not giving a higher score initially.

Reviewer 02Rating 8· accept, good paperConfidence 3

Strengths

- Red teaming mediated by in context learning is appealing because of the inductive biases that models have and because of a human’s ability to influence the process with prompting. - I think their dataset of 76k prompts will genuinely be useful (I haven’t personally looked through examples from it though.) - Section 3.2. was well-done. - Overall well-written

Weaknesses

1. I get how SFS is a relevant few-show baseline. But it seems like a fairly weak one overall. Other, perhaps less-efficient baselines could have been tested. For example, one could use the type of RL-based attack technique used in [Deng et al. (2022)](https://arxiv.org/abs/2205.12548), [Perez et al. (2022)](https://arxiv.org/abs/2202.03286), and [Casper et al. (2023)](https://arxiv.org/abs/2306.09442). Other approaches based on zero-order search could also be used like [Zou et al. (2023)](https

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

- The idea is simple and intuitive. - The paper is well-written and easy to understand. - The paper contains red-teaming results for both text-to-text models and text-to-image models. - The authors evaluate the baseline method and FLIRT with GPT-Neo as a red LM, which is much cheaper than Gopher used in [Perez et al., 2022]. It is a huge contribution for the following researchers.

Weaknesses

If I understood correctly, the contribution of this paper can be listed as follows: a. Propose in-context learning methods which is better than stochastic-few-shot of [Perez et al., 2022]. b. The proposed methods can control diversity and toxicity of generated prompts. c. Evaluate the red team methods on not only text-to-text models but also text-to-image models. Soundness [a]: The empirical results supporting the superiority of the proposed method seem weak. Missing reference [b,c]: There

Videos

FLIRT: Feedback Loop In-context Red Teaming· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning

MethodsDiffusion