TL;DR
MAJIC introduces an adaptive, iterative approach to jailbreaking large language models by dynamically combining diverse strategies, significantly improving success rates with fewer queries compared to previous static methods.
Contribution
This paper presents MAJIC, a novel Markovian framework that adaptively combines multiple disguise strategies for more effective and efficient black-box LLM jailbreaking.
Findings
Achieves over 90% success rate on GPT-4o and Gemini-2.0-flash.
Requires fewer than 15 queries on average per attack.
Outperforms existing static and rigid attack methods.
Abstract
Large Language Models (LLMs) have exhibited remarkable capabilities but remain vulnerable to jailbreaking attacks, which can elicit harmful content from the models by manipulating the input prompts. Existing black-box jailbreaking techniques primarily rely on static prompts crafted with a single, non-adaptive strategy, or employ rigid combinations of several underperforming attack methods, which limits their adaptability and generalization. To address these limitations, we propose MAJIC, a Markovian adaptive jailbreaking framework that attacks black-box LLMs by iteratively combining diverse innovative disguise strategies. MAJIC first establishes a ``Disguise Strategy Pool'' by refining existing strategies and introducing several innovative approaches. To further improve the attack performance and efficiency, MAJIC formulate the sequential selection and fusion of strategies in the pool…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
