EvoJail: Evolutionary Diverse Jailbreak Prompt Generation for Large Language Models
Rui Tang, Kaiyu Xu, Pengsen Cheng, Hao Ren, Haizhou Wang, Shuyu Jiang

TL;DR
EvoJail is an evolutionary framework for generating diverse and adaptable jailbreak prompts for large language models, improving safety testing across model versions.
Contribution
It introduces a multi-objective evolutionary approach with instruction fusion and diversity-aware objectives to enhance jailbreak prompt diversity and adaptability.
Findings
Achieves over 93% attack success rate.
Improves diversity metrics by more than 5.6%.
Outperforms state-of-the-art methods in adaptability and diversity.
Abstract
As LLMs continue to shape real-world applications, automated jailbreak generation becomes essential to reveal safety weaknesses and guide model improvement. Existing automatic jailbreak generation methods have not yet fully considered two important aspects: adaptability to evolving safety-finetuned models, which affects their effectiveness on newer model versions, and diversity in generated prompts, which can cause narrow or repetitive attack patterns. To address these issues, we propose EvoJail, an instruction-fusion-driven evolutionary jailbreak generation framework that formalizes jailbreak prompt generation as a multi-objective black-box optimization problem and leverages the principles of evolutionary algorithms to search for jailbreak prompts that can adapt across different model versions and exhibit diverse attack patterns. Specifically, EvoJail integrates jailbreak prompt…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
