The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring
Ismail Hossain, Tanzim Ahad, Md Jahangir Alam, Sai Puppala, Syed Bahauddin Alam, Sajedul Talukder

TL;DR
This paper introduces a large-scale dataset, automated generation methods, and a novel evaluation metric for adversarial jailbreak prompts targeting large language models, enhancing systematic security assessment.
Contribution
It presents a comprehensive framework including a large dataset, instruction-fine-tuned generators, and a training-free evaluator for more effective jailbreak attack analysis.
Findings
Constructed 114,000 adversarial prompts across 14 attack categories.
Developed generators with perplexity 24-39 outperforming previous models.
Proposed OPTIMUS, a metric that effectively distinguishes jailbreak quality without training.
Abstract
Jailbreak attacks -- adversarial prompts that bypass LLM alignment through purely linguistic manipulation -- pose a growing operational security threat, yet the field lacks large-scale, reproducible infrastructure for generating, categorizing, and evaluating them systematically. This paper addresses that gap with three contributions. (1) Large-scale compositional jailbreak dataset. We construct 114,000 adversarial prompts by applying 912 composing strategies to 125 harmful seed prompts from JailBreakV-28K. Every prompt is assigned to one of 14 cybersecurity attack categories (e.g., malware, phishing, privilege escalation) via a six-model majority-vote pipeline, and each strategy is ranked by effectiveness per category, enabling principled strategy selection grounded in concrete adversarial objectives. (2) Automated jailbreak generation. We instruction-fine-tune category-aware LLMs on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
