Can Large Language Models Automatically Jailbreak GPT-4V?
Yuanwei Wu, Yue Huang, Yixin Liu, Xiang Li, Pan Zhou, Lichao Sun

TL;DR
This paper introduces AutoJailbreak, an automated prompt optimization method using LLMs to effectively exploit GPT-4V vulnerabilities, achieving over 95% success rate and highlighting security concerns.
Contribution
The study presents a novel automatic jailbreak technique leveraging LLMs for prompt refinement, improving efficiency and success rate over traditional methods.
Findings
AutoJailbreak achieves over 95.3% attack success rate.
The method significantly outperforms conventional jailbreak approaches.
Efficient search with early stopping reduces optimization time.
Abstract
GPT-4V has attracted considerable attention due to its extraordinary capacity for integrating and processing multimodal information. At the same time, its ability of face recognition raises new safety concerns of privacy leakage. Despite researchers' efforts in safety alignment through RLHF or preprocessing filters, vulnerabilities might still be exploited. In our study, we introduce AutoJailbreak, an innovative automatic jailbreak technique inspired by prompt optimization. We leverage Large Language Models (LLMs) for red-teaming to refine the jailbreak prompt and employ weak-to-strong in-context learning prompts to boost efficiency. Furthermore, we present an effective search method that incorporates early stopping to minimize optimization time and token expenditure. Our experiments demonstrate that AutoJailbreak significantly surpasses conventional methods, achieving an Attack Success…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need · Early Stopping
