One Perturbation, Two Failure Modes: Probing VLM Safety via Embedding-Guided Typographic Perturbations
Ravikumar Balakrishnan, Sanket Mendapara

TL;DR
This paper investigates typographic prompt injections in vision language models, revealing that embedding distance predicts attack success and can be used to stress test model safety and readability under perturbations.
Contribution
It provides an empirical analysis linking embedding distance to attack success and introduces a red teaming method to evaluate safety and readability.
Findings
Embedding distance strongly predicts attack success rate.
Optimizing embedding similarity improves attack success and reduces safety refusals.
The dominant attack mechanism varies with model safety filters and visual degradation.
Abstract
Typographic prompt injection exploits vision language models' (VLMs) ability to read text rendered in images, posing a growing threat as VLMs power autonomous agents. Prior work typically focus on maximizing attack success rate (ASR) but does not explain \emph{why} certain renderings bypass safety alignment. We make two contributions. First, an empirical study across four VLMs including GPT-4o and Claude, twelve font sizes, and ten transformations reveals that multimodal embedding distance strongly predicts ASR ( to , ), providing an interpretable, model agnostic proxy. Since embedding distance predicts ASR, reducing it should improve attack success, but the relationship is mediated by two factors: perceptual readability (whether the VLM can parse the text) and safety alignment (whether it refuses to comply). Second, we use this as a red teaming tool: we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
