Finding a Wolf in Sheep's Clothing: Combating Adversarial Text-To-Image   Prompts with Text Summarization

Portia Cooper; Harshita Narnoli; Mihai Surdeanu

arXiv:2412.12212·cs.CR·December 18, 2024

Finding a Wolf in Sheep's Clothing: Combating Adversarial Text-To-Image Prompts with Text Summarization

Portia Cooper, Harshita Narnoli, Mihai Surdeanu

PDF

Open Access

TL;DR

This paper introduces a two-layer approach combining text summarization and classification to improve detection of adversarial prompts in text-to-image models, significantly enhancing content moderation effectiveness against obfuscation attacks.

Contribution

The study presents a novel two-layer method using text summarization and classification, along with a new dataset, to counteract obfuscation in content moderation for text-to-image prompts.

Findings

01

F1 score improved by 31% over baseline

02

Achieved up to 98% F1 score with summarized prompts

03

Pre-classification summarization enhances detection of obfuscated prompts

Abstract

Text-to-image models are vulnerable to the stepwise "Divide-and-Conquer Attack" (DACA) that utilize a large language model to obfuscate inappropriate content in prompts by wrapping sensitive text in a benign narrative. To mitigate stepwise DACA attacks, we propose a two-layer method involving text summarization followed by binary classification. We assembled the Adversarial Text-to-Image Prompt (ATTIP) dataset ( $N = 940$ ), which contained DACA-obfuscated and non-obfuscated prompts. From the ATTIP dataset, we created two summarized versions: one generated by a small encoder model and the other by a large language model. Then, we used an encoder classifier and a GPT-4o classifier to perform content moderation on the summarized and unsummarized prompts. When compared with a classifier that operated over the unsummarized data, our method improved F1 score performance by 31%. Further, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques