The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

Ismail Hossain; Tanzim Ahad; Md Jahangir Alam; Sai Puppala; Syed Bahauddin Alam; Sajedul Talukder

arXiv:2605.09225·cs.CR·May 12, 2026

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

Ismail Hossain, Tanzim Ahad, Md Jahangir Alam, Sai Puppala, Syed Bahauddin Alam, Sajedul Talukder

PDF

TL;DR

This paper introduces a large-scale dataset, automated generation methods, and a novel evaluation metric for adversarial jailbreak prompts targeting large language models, enhancing systematic security assessment.

Contribution

It presents a comprehensive framework including a large dataset, instruction-fine-tuned generators, and a training-free evaluator for more effective jailbreak attack analysis.

Findings

01

Constructed 114,000 adversarial prompts across 14 attack categories.

02

Developed generators with perplexity 24-39 outperforming previous models.

03

Proposed OPTIMUS, a metric that effectively distinguishes jailbreak quality without training.

Abstract

Jailbreak attacks -- adversarial prompts that bypass LLM alignment through purely linguistic manipulation -- pose a growing operational security threat, yet the field lacks large-scale, reproducible infrastructure for generating, categorizing, and evaluating them systematically. This paper addresses that gap with three contributions. (1) Large-scale compositional jailbreak dataset. We construct 114,000 adversarial prompts by applying 912 composing strategies to 125 harmful seed prompts from JailBreakV-28K. Every prompt is assigned to one of 14 cybersecurity attack categories (e.g., malware, phishing, privilege escalation) via a six-model majority-vote pipeline, and each strategy is ranked by effectiveness per category, enabling principled strategy selection grounded in concrete adversarial objectives. (2) Automated jailbreak generation. We instruction-fine-tune category-aware LLMs on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.