ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign   Users

Guanlin Li; Kangjie Chen; Shudong Zhang; Jie Zhang; Tianwei Zhang

arXiv:2405.19360·cs.CR·October 14, 2024

ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users

Guanlin Li, Kangjie Chen, Shudong Zhang, Jie Zhang, Tianwei Zhang

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

This paper introduces ART, an automatic red-teaming framework that uses vision-language and large language models to identify safety vulnerabilities in text-to-image models, revealing their toxicity and safety risks.

Contribution

The paper presents a novel automated approach for systematically evaluating safety risks in text-to-image models, including new datasets and validation of ART's effectiveness.

Findings

01

Revealed toxicity in popular open-source text-to-image models

02

Validated ART's effectiveness, adaptability, and diversity

03

Provided large-scale red-teaming datasets for safety risk analysis

Abstract

Large-scale pre-trained generative models are taking the world by storm, due to their abilities in generating creative content. Meanwhile, safeguards for these generative models are developed, to protect users' rights and safety, most of which are designed for large language models. Existing methods primarily focus on jailbreak and adversarial attacks, which mainly evaluate the model's safety under malicious prompts. Recent work found that manually crafted safe prompts can unintentionally trigger unsafe generations. To further systematically evaluate the safety risks of text-to-image models, we propose a novel Automatic Red-Teaming framework, ART. Our method leverages both vision language model and large language model to establish a connection between unsafe generations and their prompts, thereby more efficiently identifying the model's vulnerabilities. With our comprehensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

guanlinlee/art
pytorchOfficial

Models

🤗
AdamCodd/distilroberta-nsfw-prompt-stable-diffusion
model· ♡ 22
♡ 22

Videos

ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users· slideslive

Taxonomy

TopicsDigital Media Forensic Detection · Digital and Cyber Forensics · Handwritten Text Recognition Techniques

MethodsFocus