VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

Qilin Liao; Anamika Lochab; Ruqi Zhang

arXiv:2510.17759·cs.CR·October 22, 2025

VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

Qilin Liao, Anamika Lochab, Ruqi Zhang

PDF

Open Access 3 Reviews

TL;DR

VERA-V introduces a probabilistic framework for discovering multimodal jailbreaks in vision-language models, enabling more diverse and stealthy adversarial attacks by modeling joint prompt distributions.

Contribution

It presents a novel variational inference approach for multimodal jailbreak discovery, surpassing existing template-based methods in effectiveness and diversity.

Findings

01

Outperforms state-of-the-art baselines on HarmBench and HADES.

02

Achieves up to 53.75% higher attack success rate on GPT-4o.

03

Generates diverse, stealthy adversarial prompts efficiently.

Abstract

Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 5

Strengths

1. Solid engineering: the framework integrates typography, diffusion and distractors into an end-to-end pipeline (with a LoRA attacker and judge feedback loop) and is systematically implemented. 2. Broad experimental coverage: evaluated on two datasets and four VLMs (including GPT-4o), demonstrating transferability and scalability; the results offer useful reference points. 3. Introduces a “distributed red-team” perspective: emphasizes the paradigm shift from single attacks to distributional e

Weaknesses

1. Limited methodological novelty: The core framework is a direct port of VERA. Variational inference, REINFORCE optimization, and the LoRA attacker are all lifted unchanged; the paper merely moves from single-modal to multimodal inputs. It neither argues why this extension is non-trivial nor provides any theoretical justification for the necessity or benefit of a cross-modal joint model. 2. Misleading “stealth” evaluation: Table 4 uses an “image-toxicity detection rate” as the stealth metric,

Reviewer 02Rating 4Confidence 4

Strengths

1: The combination of typography (explicit cues), diffusion-generated images (implicit cues), and distractors (attention fragmentation) forms a coherent and novel attack strategy 2: The proposed attacker is flexible to be continuously optimized by leveraging the feedback from the judge model.

Weaknesses

1: This work appears to offer limited technical novelty, as it can largely be regarded as an incremental extension of VERA. The overall framework of VERA-V inherits most of its structure and methodology from VERA, raising concerns about the depth of innovation.. 2: The intuitive explanation — combining explicit and implicit adversarial cues with distractors to fragment attention — is reasonable and conceptually appealing. However, the paper provides little mechanistic evidence to substantiate t

Reviewer 03Rating 6Confidence 3

Strengths

1.The motivation of this article is clear. It focuses on the jailbreak vulnerability of multimodal large models, expands the pure text method to multimodal scenarios, compares the limitations of one-time attack generation methods, and proposes iterative optimization that provides feedback. 2.VERA-V learns to generate paired adversarial prompts through interactive feedback with the target VLM. After dual-path processing of text (typesetting and rendering) and image (adversarial signal generation

Weaknesses

1.This paper conducted thorough ablation experiments, including the influence of image composition, attack models, and evaluation models. It is possible to add the contrast effects of different approaches such as Typography transformation, Visual distraction strategy and Diffusion-based image generation. And the analysis of the ablation experiment requires more in-depth insights. 2.Table3 presents the cross-model attack effect of prompts, which is a proof of the effectiveness of the VERA-V meth

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Explainable Artificial Intelligence (XAI)