FC-Attack: Jailbreaking Multimodal Large Language Models via Auto-Generated Flowcharts
Ziyi Zhang, Zhen Sun, Zongmin Zhang, Jihui Guo, Xinlei He

TL;DR
This paper introduces FC-Attack, a novel method that uses auto-generated flowcharts to jailbreak multimodal large language models, revealing vulnerabilities and proposing defenses to improve safety.
Contribution
We propose FC-Attack, a new flowchart-based attack method that effectively induces harmful outputs in multimodal LLMs, highlighting safety risks and potential mitigation strategies.
Findings
FC-Attack achieves up to 96% success rate via images.
Flowchart shape and font style significantly affect attack success.
AdaShield reduces jailbreak effectiveness but impacts utility.
Abstract
Multimodal Large Language Models (MLLMs) have become powerful and widely adopted in some practical applications. However, recent research has revealed their vulnerability to multimodal jailbreak attacks, whereby the model can be induced to generate harmful content, leading to safety risks. Although most MLLMs have undergone safety alignment, recent research shows that the visual modality is still vulnerable to jailbreak attacks. In our work, we discover that by using flowcharts with partially harmful information, MLLMs can be induced to provide additional harmful details. Based on this, we propose a jailbreak attack method based on auto-generated flowcharts, FC-Attack. Specifically, FC-Attack first fine-tunes a pre-trained LLM to create a step-description generator based on benign datasets. The generator is then used to produce step descriptions corresponding to a harmful query, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling
