Efficient LLM-Jailbreaking via Multimodal-LLM Jailbreak
Haoxuan Ji, Zheng Lin, Zhenxing Niu, Xinbo Gao, Gang Hua

TL;DR
This paper introduces an efficient indirect method for jailbreaking large language models by leveraging multimodal models, outperforming existing methods in success rate and efficiency through an innovative embedding conversion technique.
Contribution
The paper presents a novel multimodal-based jailbreak approach that is more efficient and effective than direct methods, with improved generalization capabilities.
Findings
Outperforms state-of-the-art jailbreak methods in success rate.
More efficient than direct LLM-jailbreak techniques.
Exhibits strong cross-class generalization.
Abstract
This paper focuses on jailbreaking attacks against large language models (LLMs), eliciting them to generate objectionable content in response to harmful user queries. Unlike previous LLM-jailbreak methods that directly orient to LLMs, our approach begins by constructing a multimodal large language model (MLLM) built upon the target LLM. Subsequently, we perform an efficient MLLM jailbreak and obtain a jailbreaking embedding. Finally, we convert the embedding into a textual jailbreaking suffix to carry out the jailbreak of target LLM. Compared to the direct LLM-jailbreak methods, our indirect jailbreaking approach is more efficient, as MLLMs are more vulnerable to jailbreak than pure LLM. Additionally, to improve the attack success rate of jailbreak, we propose an image-text semantic matching scheme to identify a suitable initial input. Extensive experiments demonstrate that our approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDigital Media Forensic Detection · Handwritten Text Recognition Techniques · Digital and Cyber Forensics
