Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models

Lei Jiang; Zixun Zhang; Zizhou Wang; Xiaobing Sun; Zhen Li; Liangli Zhen; Xiaohua Xu

arXiv:2506.16760·cs.CL·June 23, 2025

Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models

Lei Jiang, Zixun Zhang, Zizhou Wang, Xiaobing Sun, Zhen Li, Liangli Zhen, Xiaohua Xu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces CAMO, a novel black-box attack method that exploits cross-modal reasoning in LVLMs to generate stealthy jailbreak prompts with fewer queries, revealing critical safety vulnerabilities.

Contribution

The work presents CAMO, a new cross-modal obfuscation framework that improves stealth and efficiency in jailbreak attacks on LVLMs, highlighting the need for better safety measures.

Findings

01

CAMO achieves high success rates in bypassing safety filters.

02

It requires fewer queries than existing attack methods.

03

CAMO demonstrates strong transferability across different LVLMs.

Abstract

Large Vision-Language Models (LVLMs) demonstrate exceptional performance across multimodal tasks, yet remain vulnerable to jailbreak attacks that bypass built-in safety mechanisms to elicit restricted content generation. Existing black-box jailbreak methods primarily rely on adversarial textual prompts or image perturbations, yet these approaches are highly detectable by standard content filtering systems and exhibit low query and computational efficiency. In this work, we present Cross-modal Adversarial Multimodal Obfuscation (CAMO), a novel black-box jailbreak attack framework that decomposes malicious prompts into semantically benign visual and textual fragments. By leveraging LVLMs' cross-modal reasoning abilities, CAMO covertly reconstructs harmful instructions through multi-step reasoning, evading conventional detection mechanisms. Our approach supports adjustable reasoning…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

As mentioned in the summary, the authors come up with a novel attack strategy that is good at obfuscation of sensitive words in harmful prompts and are able to demonstrate very high ASR across several models, which is a great experimental result. I am impressed by the strong empirical numbers.

Weaknesses

Despite the authors checking the robustness of their attack against 3 filters: perplexity, OCR, and OpenAI moderation tool -- I am interested in seeing the robustness against open-sourced guard models, like Llamaguard, WildGuard, AegisGuard (D and P). And I know the input is multimodal whereas these guard models are unimodal, so can the authors just check this output of these guard models for the text part of the input and report those experimental results. I am also interested in seeing the r

Reviewer 02Rating 2Confidence 4

Strengths

- The paper proposed a simple but effective jailbreak attack, by nesting the math problem with OCR for smuggling the harmful contents within the MLLM reasoning process. - The paper is well-written and easy to follow. - Extensive experiments over both open- / closed-source models with ablation studies and analysis, confirming the effectiveness of the proposed jailbreak attack.

Weaknesses

- The paper’s core idea of obfuscating harmful text and guiding the model to reconstruct it is conceptually similar to prior jailbreak and text-obfuscation methods such as [R1]. The overall attack logic—hiding malicious intent through structured obfuscation and letting the model decode it internally—is not substantially new. The novelty thus feels incremental, focusing on implementation details rather than a fundamentally new attack paradigm. - Despite the paper’s claim that CAMO significantly r

Reviewer 03Rating 2Confidence 4

Strengths

1. The paper is easy to follow. 2. The method is a black-box attack and it is easy to implement.

Weaknesses

1. This method applied a widely-used jailbreak strategy (reasoning distraction) in the cross-modal setting, which lacks novelty. 2. The paper claimed that "CAMO decomposed harmful instructions into benign-looking textual and visual clues", but from Figure 2, it seems that the visual input is still harmful (a bomb). 3. This paper is only evaluated on AdvBench family datasets, which doesn't effectively prove the validity of CAMO. 4. The baseline methods compared in the paper are not the current st

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis