Security Risk of Misalignment between Text and Image in Multi-modal Model
Xiaosen Wang, Zhijin Ge, Shaokang Wang

TL;DR
This paper uncovers significant misalignments in text-image models that pose safety risks and introduces PReMA, a novel attack manipulating outputs through adversarial images, highlighting new security concerns in multi-modal diffusion models.
Contribution
The paper identifies critical alignment issues in multi-modal diffusion models and proposes PReMA, the first attack that manipulates outputs solely via adversarial images without changing prompts.
Findings
PReMA effectively manipulates generated content in various models.
Misalignment between text and image modalities poses safety risks.
PReMA demonstrates high efficacy in image inpainting and style transfer tasks.
Abstract
Despite the notable advancements and versatility of multi-modal diffusion models, such as text-to-image models, their susceptibility to adversarial inputs remains underexplored. Contrary to expectations, our investigations reveal that the alignment between textual and Image modalities in existing diffusion models is inadequate. This misalignment presents significant risks, especially in the generation of inappropriate or Not-Safe-For-Work (NSFW) content. To this end, we propose a novel attack called Prompt-Restricted Multi-modal Attack (PReMA) to manipulate the generated content by modifying the input image in conjunction with any specified prompt, without altering the prompt itself. PReMA is the first attack that manipulates model outputs by solely creating adversarial images, distinguishing itself from prior methods that primarily generate adversarial prompts to produce NSFW content.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
