AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts

Yufan Liu; Wanqian Zhang; Huashan Chen; Lin Wang; Xiaojun Jia; Zheng Lin; Weiping Wang

arXiv:2510.24034·cs.CV·October 29, 2025

AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts

Yufan Liu, Wanqian Zhang, Huashan Chen, Lin Wang, Xiaojun Jia, Zheng Lin, Weiping Wang

PDF

TL;DR

This paper introduces APT, a black-box framework using large language models to generate human-readable adversarial prompts that effectively bypass safety filters in text-to-image models, revealing vulnerabilities.

Contribution

The paper presents a novel LLM-driven method for automated, human-readable adversarial prompt generation that bypasses filters and requires no white-box access.

Findings

01

Effective red-teaming performance demonstrated

02

High transferability to unseen prompts

03

Vulnerabilities exposed in commercial T2I APIs

Abstract

Despite rapid advancements in text-to-image (T2I) models, their safety mechanisms are vulnerable to adversarial prompts, which maliciously generate unsafe images. Current red-teaming methods for proactively assessing such vulnerabilities usually require white-box access to T2I models, and rely on inefficient per-prompt optimization, as well as inevitably generate semantically meaningless prompts easily blocked by filters. In this paper, we propose APT (AutoPrompT), a black-box framework that leverages large language models (LLMs) to automatically generate human-readable adversarial suffixes for benign prompts. We first introduce an alternating optimization-finetuning pipeline between adversarial suffix optimization and fine-tuning the LLM utilizing the optimized suffix. Furthermore, we integrates a dual-evasion strategy in optimization phase, enabling the bypass of both perplexity-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.