One Perturbation, Two Failure Modes: Probing VLM Safety via Embedding-Guided Typographic Perturbations

Ravikumar Balakrishnan; Sanket Mendapara

arXiv:2604.25102·cs.CV·April 29, 2026

One Perturbation, Two Failure Modes: Probing VLM Safety via Embedding-Guided Typographic Perturbations

Ravikumar Balakrishnan, Sanket Mendapara

PDF

TL;DR

This paper investigates typographic prompt injections in vision language models, revealing that embedding distance predicts attack success and can be used to stress test model safety and readability under perturbations.

Contribution

It provides an empirical analysis linking embedding distance to attack success and introduces a red teaming method to evaluate safety and readability.

Findings

01

Embedding distance strongly predicts attack success rate.

02

Optimizing embedding similarity improves attack success and reduces safety refusals.

03

The dominant attack mechanism varies with model safety filters and visual degradation.

Abstract

Typographic prompt injection exploits vision language models' (VLMs) ability to read text rendered in images, posing a growing threat as VLMs power autonomous agents. Prior work typically focus on maximizing attack success rate (ASR) but does not explain \emph{why} certain renderings bypass safety alignment. We make two contributions. First, an empirical study across four VLMs including GPT-4o and Claude, twelve font sizes, and ten transformations reveals that multimodal embedding distance strongly predicts ASR ( $r = - 0.71$ to $- 0.93$ , $p < 0.01$ ), providing an interpretable, model agnostic proxy. Since embedding distance predicts ASR, reducing it should improve attack success, but the relationship is mediated by two factors: perceptual readability (whether the VLM can parse the text) and safety alignment (whether it refuses to comply). Second, we use this as a red teaming tool: we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.