On the Feasibility of Poisoning Text-to-Image AI Models via Adversarial Mislabeling
Stanley Wu, Ronik Bhaskar, Anna Yoo Jeong Ha, Shawn Shan, Haitao Zheng, Ben Y. Zhao

TL;DR
This paper investigates how adversarial mislabeling attacks on vision-language models can poison training data for text-to-image AI models, demonstrating high attack success rates and highlighting vulnerabilities in current defenses.
Contribution
It reveals the vulnerability of vision-language models to adversarial mislabeling attacks and demonstrates their potential to poison training data for text-to-image models.
Findings
VLMs are highly vulnerable to adversarial perturbations.
Adversarial mislabeling can inject effective poison samples into training.
High attack success rate (>73%) in black-box scenarios.
Abstract
Today's text-to-image generative models are trained on millions of images sourced from the Internet, each paired with a detailed caption produced by Vision-Language Models (VLMs). This part of the training pipeline is critical for supplying the models with large volumes of high-quality image-caption pairs during training. However, recent work suggests that VLMs are vulnerable to stealthy adversarial attacks, where adversarial perturbations are added to images to mislead the VLMs into producing incorrect captions. In this paper, we explore the feasibility of adversarial mislabeling attacks on VLMs as a mechanism to poisoning training pipelines for text-to-image models. Our experiments demonstrate that VLMs are highly vulnerable to adversarial perturbations, allowing attackers to produce benign-looking images that are consistently miscaptioned by the VLM models. This has the effect of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
