TL;DR
This paper introduces a new evaluation framework and a recursive rewriting strategy to improve the effectiveness of jailbreak attacks on multimodal large language models, revealing vulnerabilities in current safety mechanisms.
Contribution
It proposes a four-axis evaluation framework and a novel BSD rewriting method to enhance jailbreak success rates against MLLMs.
Findings
Balanced prompts evade filters and produce harmful outputs
BSD increases attack success rate by 67%
Current safety systems have significant vulnerabilities
Abstract
Multimodal large language models (MLLMs) are widely used in vision-language reasoning tasks. However, their vulnerability to adversarial prompts remains a serious concern, as safety mechanisms often fail to prevent the generation of harmful outputs. Although recent jailbreak strategies report high success rates, many responses classified as "successful" are actually benign, vague, or unrelated to the intended malicious goal. This mismatch suggests that current evaluation standards may overestimate the effectiveness of such attacks. To address this issue, we introduce a four-axis evaluation framework that considers input on-topicness, input out-of-distribution (OOD) intensity, output harmfulness, and output refusal rate. This framework identifies truly effective jailbreaks. In a substantial empirical study, we reveal a structural trade-off: highly on-topic prompts are frequently blocked…
Peer Reviews
Decision·Submitted to ICLR 2026
- This paper defines and separates on-topicness and OOD-intensity, enabling structured, interpretable analysis of jailbreak mechanisms in multimodal LLMs. - This paper proposes BSD, a recursive decomposition framework that operationalizes this balance and achieves improved attack success rates.
- The contribution is primarily heuristic and engineering-focused. BSD refines existing decomposition-based attacks rather than introducing new theoretical insights into model vulnerability. - The BSD framework mainly extends prior decomposition-based attacks with additional heuristics (WBS, SDU) but lacks causal or mechanistic justification for why balancing on-topicness and OOD-intensity fundamentally increases vulnerability. - The hierarchical search and threshold settings introduce many he
+ The introduction of the four-axis framework (On-Topicness, OOD-Intensity, Harmfulness, Refusal Rate) provides a more nuanced and comprehensive standard for evaluating MLLM jailbreaks, addressing the overestimation problem in prior work. + The paper makes a good contribution by empirically identifying and formalizing the fundamental trade-off between prompt relevance and novelty, offering a clear explanation for the limitations of existing attack strategies. + The proposed BSD strategy is a wel
- The authors' findings do not significantly differ from previous work. For example, one of the main findings of this paper, that existing jailbreaks are generally ineffective, is consistent with the conclusions of Nikolić et al. (ICML’25). However, this paper fails to provide sufficient new insights or highlight the differences from previous work. - While BSD integrates descriptive and distracting images, it does not analyze specific visual features (such as content relevance) or perform ablati
- The paper is tackling an unexplored but practical limitation of existing OOD attacks with qualitative analysis of relevance-novelty trade-off. - The paper proposed a novel tree-based decomposition attack strategy to balance the trade-off. - The paper is well-written and easy to follow. - The comprehensive experimental results over closed and open-source models demonstrate the jailbreak effectiveness of proposed method with detailed qualitative analysis.
- One of my concerns is in the paper’s reliance on SBERT embeddings to quantify On-Topicness (OT) and Out-of-Distribution Intensity (OI). SBERT has known limitations in capturing subtle semantic variations—such as word order perturbations or nuanced paraphrases [R1] —which may undermine the reliability of these metrics. For instance, in Eq (2), when the short caption from MLLM is subtly but semantically altered where only minor lexical changes lead to a substantially different meaning—it is uncl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
