Innocence in the Crossfire: Roles of Skip Connections in Jailbreaking Visual Language Models
Palash Nandi, Maithili Joshi, Tanmoy Chakraborty

TL;DR
This paper investigates how prompt design factors and internal model structures influence the ability to jailbreak visual language models, revealing vulnerabilities and proposing a skip-connection framework to enhance jailbreak success rates.
Contribution
It introduces a novel skip-connection framework within VLMs that significantly increases jailbreak success, highlighting complex vulnerabilities in multimodal prompts.
Findings
VLMs distinguish benign and harmful inputs well in unimodal settings
Multimodal contexts significantly degrade model safety
Skip-connection framework boosts jailbreak success rates
Abstract
Language models are highly sensitive to prompt formulations - small changes in input can drastically alter their output. This raises a critical question: To what extent can prompt sensitivity be exploited to generate inapt content? In this paper, we investigate how discrete components of prompt design influence the generation of inappropriate content in Visual Language Models (VLMs). Specifically, we analyze the impact of three key factors on successful jailbreaks: (a) the inclusion of detailed visual information, (b) the presence of adversarial examples, and (c) the use of positively framed beginning phrases. Our findings reveal that while a VLM can reliably distinguish between benign and harmful inputs in unimodal settings (text-only or image-only), this ability significantly degrades in multimodal contexts. Each of the three factors is independently capable of triggering a jailbreak,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Computational and Text Analysis Methods
