Self-interpreting Adversarial Images
Tingwei Zhang, Collin Zhang, John X. Morris, Eugene Bagdasarian, Vitaly Shmatikov

TL;DR
This paper introduces self-interpreting adversarial images that embed hidden instructions to manipulate visual language models' responses and style, posing new security challenges and requiring defenses.
Contribution
It presents a novel attack method using self-interpreting images as soft prompts to control model outputs and interpretations, expanding prompt injection techniques.
Findings
Self-interpreting images can effectively steer model responses and styles.
The attacks are natural-looking and coherent, making detection difficult.
Potential for misuse in spreading misinformation and spam.
Abstract
We introduce a new type of indirect, cross-modal injection attacks against visual language models that enable creation of self-interpreting images. These images contain hidden "meta-instructions" that control how models answer users' questions about the image and steer models' outputs to express an adversary-chosen style, sentiment, or point of view. Self-interpreting images act as soft prompts, conditioning the model to satisfy the adversary's (meta-)objective while still producing answers based on the image's visual content. Meta-instructions are thus a stronger form of prompt injection. Adversarial images look natural and the model's answers are coherent and plausible, yet they also follow the adversary-chosen interpretation, e.g., political spin, or even objectives that are not achievable with explicit text instructions. We evaluate the efficacy of self-interpreting images for a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
