Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs
Yuxuan Zhou, Yuzhao Peng, Yang Bai, Kuofeng Gao, Yihao Zhang, Yechao Zhang, Xun Chen, Tao Yu, Tao Dai, Shu-Tao Xia

TL;DR
This paper investigates why mild OOD strategies are more effective in VLM jailbreaks, revealing a trade-off between input perception and refusal triggers, and introduces a new, more effective jailbreak method based on OCR capabilities.
Contribution
It provides a theoretical explanation for the weak-OOD phenomenon and proposes a novel jailbreak approach leveraging OCR capabilities, outperforming existing methods.
Findings
Weak-OOD samples better bypass safety constraints.
Trade-off between input intent perception and model refusal triggers.
Proposed OCR-based method surpasses SOTA jailbreak techniques.
Abstract
Large Vision-Language Models (VLMs) are susceptible to jailbreak attacks: researchers have developed a variety of attack strategies that can successfully bypass the safety mechanisms of VLMs. Among these approaches, jailbreak methods based on the Out-of-Distribution (OOD) strategy have garnered widespread attention due to their simplicity and effectiveness. This paper further advances the in-depth understanding of OOD-based VLM jailbreak methods. Experimental results demonstrate that jailbreak samples generated via mild OOD strategies exhibit superior performance in circumventing the safety constraints of VLMs--a phenomenon we define as ''weak-OOD''. To unravel the underlying causes of this phenomenon, this study takes SI-Attack, a typical OOD-based jailbreak method, as the research object. We attribute this phenomenon to a trade-off between two dominant factors: input intent perception…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper focuses on a meaningful question—why weak OOD helps jailbreak VLMs, rather than only proposing another attack. The authors perform ablations across multiple attack methods and target models, illustrating the weak-OOD pattern with concrete quantitative trends and internal activation analyses. The mechanistic view separating intent perception and refusal triggers is supported by empirical signals across layers, and the theory connects naturally to pre-training versus alignment data distr
Although the central idea is interesting, the theoretical component still feels heuristic. The latent-space interpretation depends on assumptions about token roles and feature locality that recent work debates; the evidence is more correlational than causal. Some experimental choices raise questions about generalization. For example, the heavy reliance on shuffle-based perturbations when defining OOD magnitude, and the use of GPT-4o as a judge in evaluation, which may introduce bias or shortcu
1. Paper is easy to follow and read. 2. The intuition of seeking weak-OOD in perturbation is straightforward.
1. Layer heterogeneity is ignored in the perception measurement. Simply doing the aggregation by averaging the Layer-Wise Cosine Similarity and Layer-Wise Refusal Similarity seems a bit too brutal. 2. Especially for the jailbreak part, the different settings in creating the perturbations are not provided with ablation studies (those from appx. tab. C). The impact of the settings on the degree of intent perception and on the refusal triggering is not quantified or further analyzed. I think this i
The paper provides a novel theoretical insight by identifying the pretrain–alignment inconsistency as the key reason why weak OOD perturbations can effectively bypass safety mechanisms. It demonstrates strong empirical rigor, with comprehensive ablations and benchmarks confirming that mild OOD manipulations outperform existing jailbreak methods. The proposed JOCR attack introduces an innovative OCR-aware mechanism that not only validates the theory but also exposes practical vulnerabilities in
1. While the paper attributes the improved jailbreak success under mild perturbations to a weak-OOD effect arising from pretrain–alignment mismatch, an alternative explanation could be that the model is simply non-robust to low-level perturbations. The experiments demonstrate that moderate perturbations yield higher attack success rates, but it is unclear whether this behavior necessarily reflects a semantic distribution shift rather than standard sensitivity to visual noise or encoding artifact
- The paper reaffirms the vulnerability of VLMs with regard to weak OOD harmful inputs and the mismatched generalization between pre-training and safety-alignment training. - The proposed method extends prior work (i.e., FigStep) which leverages the poor generalization of safety-alignment on OCR harmful inputs, showing great jailbreak effectiveness even with simple modifications (e.g., font size variation) from the baseline (FigStep). - The paper proposed a mechanistic quantitative framework to
- The paper’s central hypothesis that the weak-OOD phenomenon arises from the asymmetry between pre-training and alignment is not conceptually new. Similar insights have already been discussed in the previous work (https://arxiv.org/abs/2307.02483), which tackles mismatched generalization that arises when inputs are out-of-distribution for a model’s safety training data but within the scope of its broad pretraining corpus. - This paper largely reiterates previously reported empirical findings ra
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
