ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users
Guanlin Li, Kangjie Chen, Shudong Zhang, Jie Zhang, Tianwei Zhang

TL;DR
This paper introduces ART, an automatic red-teaming framework that uses vision-language and large language models to identify safety vulnerabilities in text-to-image models, revealing their toxicity and safety risks.
Contribution
The paper presents a novel automated approach for systematically evaluating safety risks in text-to-image models, including new datasets and validation of ART's effectiveness.
Findings
Revealed toxicity in popular open-source text-to-image models
Validated ART's effectiveness, adaptability, and diversity
Provided large-scale red-teaming datasets for safety risk analysis
Abstract
Large-scale pre-trained generative models are taking the world by storm, due to their abilities in generating creative content. Meanwhile, safeguards for these generative models are developed, to protect users' rights and safety, most of which are designed for large language models. Existing methods primarily focus on jailbreak and adversarial attacks, which mainly evaluate the model's safety under malicious prompts. Recent work found that manually crafted safe prompts can unintentionally trigger unsafe generations. To further systematically evaluate the safety risks of text-to-image models, we propose a novel Automatic Red-Teaming framework, ART. Our method leverages both vision language model and large language model to establish a connection between unsafe generations and their prompts, thereby more efficiently identifying the model's vulnerabilities. With our comprehensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDigital Media Forensic Detection · Digital and Cyber Forensics · Handwritten Text Recognition Techniques
MethodsFocus
