Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

Kaiyuan Cui; Yige Li; Yutao Wu; Xingjun Ma; Sarah Erfani; Christopher Leckie; Hanxun Huang

arXiv:2602.01025·cs.LG·February 3, 2026

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

Kaiyuan Cui, Yige Li, Yutao Wu, Xingjun Ma, Sarah Erfani, Christopher Leckie, Hanxun Huang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces UltraBreak, a novel framework for creating universal, transferable jailbreak attacks on vision-language models by leveraging vision-space transformations and semantic textual objectives, significantly improving attack transferability.

Contribution

UltraBreak is the first method to combine vision-level regularisation with semantic textual objectives to generate universal, transferable jailbreaks for vision-language models, overcoming prior transferability limitations.

Findings

01

UltraBreak outperforms previous jailbreak methods in transferability.

02

Semantic objectives help smooth the loss landscape, enhancing transferability.

03

The approach generalizes across diverse models and attack targets.

Abstract

Vision-language models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text. However, this multimodal integration expands the attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses. Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white-box surrogate and fail to generalise to black-box models. In this work, we propose Universal and transferable jailbreak (UltraBreak), a framework that constrains adversarial patterns through transformations and regularisation in the vision space, while relaxing textual targets through semantic-based objectives. By defining its loss in the textual embedding space of the target LLM, UltraBreak discovers universal adversarial patterns that generalise across diverse jailbreak objectives.…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

## Originality This is the first paper, that I know of, to present a method that can produce image jailbreaks that transfer between models. The idea of constraining the input space is not novel, but the semantic loss (and implementation) is new to me. ## Quality The quality of experiments is good. The leave one out ablations also give insight into which components of the algorithm are important. I was initially skeptical about the need for attention in the semantic loss, but the results in Fi

Weaknesses

I think there should be more focus on transfer to frontier models. The bottom section of table 1 has some good results in this area. The paper would be improved by adding results with more current frontier models. I think some of the language in the introduction is too strong. For example you state "We present UltraBreak, the first jailbreak framework to achieve effective cross-target universality and cross-model transferability against VLMs." This could be interpreted as meaning no prior work

Reviewer 02Rating 6Confidence 3

Strengths

1. The paper is logically well-organized, and the motivation is clear. 2. The proposed method achieves strong performance, though the degree of novelty is uncertain.

Weaknesses

1. Figure 1 is hard to follow. I suggest adding a more explicit flow in the caption, or introducing a small algorithm box to walk through the pipeline step by step. 2. There are minor typos—for example, line 199 uses w*t,j; I believe this should be $w_{t,j}$, right? 3. It’s unclear how the method deals with the potentially spiky loss landscape induced by the total variation loss.

Reviewer 03Rating 8Confidence 3

Strengths

- Strengths - I think the paper is targetting and important problem—the development of universal and transferable jailbreaks on image models. Previous work showed this was challenging. - The method, as explained in the introduction, broadly makes sense to me and is intuitive. I like the shift from a log-likelihood to semantic based loss. - The paper is clear and well written. - The ablation studies show the components make sense and help performance. - I liked the extra analy

Weaknesses

- Weaknesses - I think the exposition could be tightened up in places (see questions below). - Ideally I'd love to see a big stronger jailbreak evaluation using something like StrongReject. - Ideally I'd love to see some additional target models, like Claude Sonnet 4.5, GPT-5. - "UltraBreak consistently outperforms all gradient-based baselines across target models and both test sets. One exception is ..." Please tone down the writing e.g., with "tends to outperform" - It woul

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Multimodal Machine Learning Applications