AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models
Yiming Wang, Jiahao Chen, Qingming Li, Tong Zhang, Rui Zeng, Xing Yang, Shouling Ji

TL;DR
AEIOU is a versatile, efficient, and interpretable framework that enhances safety in text-to-image models by accurately detecting NSFW prompts using hidden state features, outperforming existing moderation tools.
Contribution
This paper introduces AEIOU, a novel unified defense framework that leverages hidden state features for accurate, real-time NSFW prompt detection in T2I models, with broad adaptability and improved efficiency.
Findings
Achieves over 95% accuracy across datasets.
Improves detection efficiency by at least tenfold.
Effectively counters adaptive and multi-label attacks.
Abstract
As text-to-image (T2I) models advance and gain widespread adoption, their associated safety concerns are becoming increasingly critical. Malicious users exploit these models to generate Not-Safe-for-Work (NSFW) images using harmful or adversarial prompts, underscoring the need for effective safeguards to ensure the integrity and compliance of model outputs. However, existing detection methods often exhibit low accuracy and inefficiency. In this paper, we propose AEIOU, a defense framework that is adaptable, efficient, interpretable, optimizable, and unified against NSFW prompts in T2I models. AEIOU extracts NSFW features from the hidden states of the model's text encoder, utilizing the separable nature of these features to detect NSFW prompts. The detection process is efficient, requiring minimal inference time. AEIOU also offers real-time interpretation of results and supports…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
